fix(connection): filter zombie connections part 2

2025-06-07 20:37:49 +00:00
parent 19590ef107
commit 890e907664
4 changed files with 498 additions and 1 deletions
--- a/readme.connections.md
+++ b/readme.connections.md
@@ -548,4 +548,129 @@ Debug scripts confirmed:
 - The zombie detection successfully identifies and cleans up these connections
 - Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled

-This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
+This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
+
+## 🔍 Production Diagnostics (January 2025)
+
+Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
+
+### How to Use the Production Monitor
+
+1. **Add to your proxy startup script**:
+```typescript
+import ProductionConnectionMonitor from './production-connection-monitor.js';
+
+// After proxy.start()
+const monitor = new ProductionConnectionMonitor(proxy);
+monitor.start(5000); // Check every 5 seconds
+
+// Monitor will automatically capture diagnostics when:
+// - Connections exceed threshold (default: 50)
+// - Sudden spike occurs (default: +20 connections)
+```
+
+2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
+
+3. **Force capture anytime**: `monitor.forceCaptureNow()`
+
+### What the Monitor Captures
+
+For each connection:
+- Socket states (destroyed, readable, writable, readyState)
+- Connection flags (closed, keepAlive, TLS status)
+- Data transfer statistics
+- Time since last activity
+- Cleanup queue status
+- Event listener counts
+- Termination reasons
+
+### Pattern Analysis
+
+The monitor automatically identifies:
+- **Zombie connections**: Both sockets destroyed but not cleaned up
+- **Half-zombies**: One socket destroyed
+- **Stuck connecting**: Outgoing socket stuck in connecting state
+- **No outgoing**: Missing outgoing socket
+- **Keep-alive stuck**: Keep-alive connections with no recent activity
+- **Old connections**: Connections older than 1 hour
+- **No data transfer**: Connections with no bytes transferred
+- **Listener leaks**: Excessive event listeners
+
+### Common Accumulation Patterns
+
+1. **Connecting State Stuck**
+   - Outgoing socket shows `connecting: true` indefinitely
+   - Usually means connection timeout not working
+   - Check if backend is reachable
+
+2. **Missing Outgoing Socket**
+   - Connection has no outgoing socket but isn't closed
+   - May indicate immediate routing issues
+   - Check error logs during connection setup
+
+3. **Event Listener Accumulation**
+   - High listener counts (>20) on sockets
+   - Indicates cleanup not removing all listeners
+   - Can cause memory leaks
+
+4. **Keep-Alive Zombies**
+   - Keep-alive connections not timing out
+   - Check keepAlive timeout settings
+   - May need more aggressive cleanup
+
+### Next Steps
+
+1. **Run the monitor in production** during accumulation
+2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
+3. **Look for patterns** in the captured snapshots
+4. **Check specific connection IDs** that accumulate
+
+The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
+
+## ✅ FIXED: Stuck Connection Detection (January 2025) 
+
+### Additional Root Cause Found
+Connections to hanging backends (that accept but never respond) were not being cleaned up because:
+- Both sockets remain alive (not destroyed)
+- Keep-alive prevents normal timeout
+- No data is sent back to the client despite receiving data
+- These don't qualify as "zombies" since sockets aren't destroyed
+
+### Fix Implemented
+Added stuck connection detection to the periodic inactivity check:
+
+```typescript
+// Check for stuck connections: no data sent back to client
+if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
+  const age = now - record.incomingStartTime;
+  // If connection is older than 60 seconds and no data sent back, likely stuck
+  if (age > 60000) {
+    logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
+      connectionId,
+      remoteIP: record.remoteIP,
+      age: plugins.prettyMs(age),
+      bytesReceived: record.bytesReceived,
+      targetHost: record.targetHost,
+      targetPort: record.targetPort,
+      component: 'connection-manager'
+    });
+    
+    // Clean up
+    this.cleanupConnection(record, 'stuck_no_response');
+  }
+}
+```
+
+### What This Fixes
+- Connections to backends that accept but never respond
+- Proxy chains where inner proxy connects to unresponsive services
+- Scenarios where keep-alive prevents normal timeout mechanisms
+- Connections that receive client data but never send anything back
+
+### Detection Criteria
+- Connection has received bytes from client (`bytesReceived > 0`)
+- No bytes sent back to client (`bytesSent === 0`)
+- Connection is older than 60 seconds
+- Both sockets are still alive (not destroyed)
+
+This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.