Implement zombie connection detection and cleanup in ConnectionManager; enhance tests for edge cases

2025-06-07 10:55:59 +00:00
parent 9094b76b1b
commit 47735adbf2
4 changed files with 564 additions and 2 deletions
--- a/readme.connections.md
+++ b/readme.connections.md
@ -372,4 +372,180 @@ The connection cleanup mechanisms have been significantly improved in v19.5.20:
 2. Immediate routing cleanup handler always destroys outgoing connections
 3. Tests confirm no accumulation in standard scenarios with reachable backends

-However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
+However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
+
+### Outer Proxy Sudden Accumulation After Hours
+
+**User Report**: "The counter goes up suddenly after some hours on the outer proxy"
+
+**Investigation Findings**:
+
+1. **Cleanup Queue Mechanism**:
+   - Connections are cleaned up in batches of 100 via a queue
+   - If the cleanup timer gets stuck or cleared without restart, connections accumulate
+   - The timer is set with `setTimeout` and could be affected by event loop blocking
+
+2. **Potential Causes for Sudden Spikes**:
+   
+   a) **Cleanup Timer Failure**:
+   ```typescript
+   // In ConnectionManager, if this timer gets cleared but not restarted:
+   this.cleanupTimer = this.setTimeout(() => {
+     this.processCleanupQueue();
+   }, 100);
+   ```
+   
+   b) **Memory Pressure**:
+   - After hours of operation, memory fragmentation or pressure could cause delays
+   - Garbage collection pauses might interfere with timer execution
+   
+   c) **Event Listener Accumulation**:
+   - Socket event listeners might accumulate over time
+   - Server 'connection' event handlers are particularly important
+   
+   d) **Keep-Alive Connection Cascades**:
+   - When many keep-alive connections timeout simultaneously
+   - Outer proxy has different timeout than inner proxy
+   - Mass disconnection events can overwhelm cleanup queue
+   
+   e) **HttpProxy Component Issues**:
+   - If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
+   - These pools might not be properly cleaned after hours
+
+3. **Why "Sudden" After Hours**:
+   - Not a gradual leak but triggered by specific conditions
+   - Likely related to periodic events or thresholds:
+     - Inactivity check runs every 30 seconds
+     - Keep-alive connections have extended timeouts (6x normal)
+     - Parity check has 30-minute timeout for half-closed connections
+   
+4. **Reproduction Scenarios**:
+   - Mass client disconnection/reconnection (network blip)
+   - Keep-alive timeout cascade when inner proxy times out first
+   - Cleanup timer getting stuck during high load
+   - Memory pressure causing event loop delays
+
+### Additional Monitoring Recommendations
+
+1. **Add Cleanup Queue Monitoring**:
+   ```typescript
+   setInterval(() => {
+     const cm = proxy.connectionManager;
+     if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
+       logger.error('Cleanup queue stuck!', {
+         queueSize: cm.cleanupQueue.size,
+         hasTimer: !!cm.cleanupTimer
+       });
+     }
+   }, 60000);
+   ```
+
+2. **Track Timer Health**:
+   - Monitor if cleanup timer is running
+   - Check for event loop blocking
+   - Log when batch processing takes too long
+
+3. **Memory Monitoring**:
+   - Track heap usage over time
+   - Monitor for memory leaks in long-running processes
+   - Force periodic garbage collection if needed
+
+### Immediate Mitigations
+
+1. **Restart Cleanup Timer**:
+   ```typescript
+   // Emergency cleanup timer restart
+   if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
+     cm.cleanupTimer = setTimeout(() => {
+       cm.processCleanupQueue();
+     }, 100);
+   }
+   ```
+
+2. **Force Periodic Cleanup**:
+   ```typescript
+   setInterval(() => {
+     const cm = connectionManager;
+     if (cm.getConnectionCount() > threshold) {
+       cm.performOptimizedInactivityCheck();
+       // Force process cleanup queue
+       cm.processCleanupQueue();
+     }
+   }, 300000); // Every 5 minutes
+   ```
+
+3. **Connection Age Limits**:
+   - Set maximum connection lifetime
+   - Force close connections older than threshold
+   - More aggressive cleanup for proxy chains
+
+## ✅ FIXED: Zombie Connection Detection (January 2025)
+
+### Root Cause Identified
+"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.
+
+### Fix Implemented
+Added zombie detection to the periodic inactivity check in ConnectionManager:
+
+```typescript
+// In performOptimizedInactivityCheck()
+// Check ALL connections for zombie state
+for (const [connectionId, record] of this.connectionRecords) {
+  if (!record.connectionClosed) {
+    const incomingDestroyed = record.incoming?.destroyed || false;
+    const outgoingDestroyed = record.outgoing?.destroyed || false;
+    
+    // Check for zombie connections: both sockets destroyed but not cleaned up
+    if (incomingDestroyed && outgoingDestroyed) {
+      logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
+        connectionId,
+        remoteIP: record.remoteIP,
+        age: plugins.prettyMs(now - record.incomingStartTime),
+        component: 'connection-manager'
+      });
+      
+      // Clean up immediately
+      this.cleanupConnection(record, 'zombie_cleanup');
+      continue;
+    }
+    
+    // Check for half-zombie: one socket destroyed
+    if (incomingDestroyed || outgoingDestroyed) {
+      const age = now - record.incomingStartTime;
+      // Give it 30 seconds grace period for normal cleanup
+      if (age > 30000) {
+        logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
+          connectionId,
+          remoteIP: record.remoteIP,
+          age: plugins.prettyMs(age),
+          incomingDestroyed,
+          outgoingDestroyed,
+          component: 'connection-manager'
+        });
+        
+        // Clean up
+        this.cleanupConnection(record, 'half_zombie_cleanup');
+      }
+    }
+  }
+}
+```
+
+### How It Works
+1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
+2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
+3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
+4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds
+
+### Why This Fixes the Outer Proxy Accumulation
+- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
+- These become zombie connections that previously accumulated indefinitely
+- Now they are detected and cleaned up within 30 seconds
+
+### Test Results
+Debug scripts confirmed:
+- Zombie connections can be created when sockets are destroyed directly without events
+- The zombie detection successfully identifies and cleans up these connections
+- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
+
+This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.