Implement zombie connection detection and cleanup in ConnectionManager; enhance tests for edge cases
This commit is contained in:
@ -372,4 +372,180 @@ The connection cleanup mechanisms have been significantly improved in v19.5.20:
|
||||
2. Immediate routing cleanup handler always destroys outgoing connections
|
||||
3. Tests confirm no accumulation in standard scenarios with reachable backends
|
||||
|
||||
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
|
||||
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
|
||||
|
||||
### Outer Proxy Sudden Accumulation After Hours
|
||||
|
||||
**User Report**: "The counter goes up suddenly after some hours on the outer proxy"
|
||||
|
||||
**Investigation Findings**:
|
||||
|
||||
1. **Cleanup Queue Mechanism**:
|
||||
- Connections are cleaned up in batches of 100 via a queue
|
||||
- If the cleanup timer gets stuck or cleared without restart, connections accumulate
|
||||
- The timer is set with `setTimeout` and could be affected by event loop blocking
|
||||
|
||||
2. **Potential Causes for Sudden Spikes**:
|
||||
|
||||
a) **Cleanup Timer Failure**:
|
||||
```typescript
|
||||
// In ConnectionManager, if this timer gets cleared but not restarted:
|
||||
this.cleanupTimer = this.setTimeout(() => {
|
||||
this.processCleanupQueue();
|
||||
}, 100);
|
||||
```
|
||||
|
||||
b) **Memory Pressure**:
|
||||
- After hours of operation, memory fragmentation or pressure could cause delays
|
||||
- Garbage collection pauses might interfere with timer execution
|
||||
|
||||
c) **Event Listener Accumulation**:
|
||||
- Socket event listeners might accumulate over time
|
||||
- Server 'connection' event handlers are particularly important
|
||||
|
||||
d) **Keep-Alive Connection Cascades**:
|
||||
- When many keep-alive connections timeout simultaneously
|
||||
- Outer proxy has different timeout than inner proxy
|
||||
- Mass disconnection events can overwhelm cleanup queue
|
||||
|
||||
e) **HttpProxy Component Issues**:
|
||||
- If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
|
||||
- These pools might not be properly cleaned after hours
|
||||
|
||||
3. **Why "Sudden" After Hours**:
|
||||
- Not a gradual leak but triggered by specific conditions
|
||||
- Likely related to periodic events or thresholds:
|
||||
- Inactivity check runs every 30 seconds
|
||||
- Keep-alive connections have extended timeouts (6x normal)
|
||||
- Parity check has 30-minute timeout for half-closed connections
|
||||
|
||||
4. **Reproduction Scenarios**:
|
||||
- Mass client disconnection/reconnection (network blip)
|
||||
- Keep-alive timeout cascade when inner proxy times out first
|
||||
- Cleanup timer getting stuck during high load
|
||||
- Memory pressure causing event loop delays
|
||||
|
||||
### Additional Monitoring Recommendations
|
||||
|
||||
1. **Add Cleanup Queue Monitoring**:
|
||||
```typescript
|
||||
setInterval(() => {
|
||||
const cm = proxy.connectionManager;
|
||||
if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
|
||||
logger.error('Cleanup queue stuck!', {
|
||||
queueSize: cm.cleanupQueue.size,
|
||||
hasTimer: !!cm.cleanupTimer
|
||||
});
|
||||
}
|
||||
}, 60000);
|
||||
```
|
||||
|
||||
2. **Track Timer Health**:
|
||||
- Monitor if cleanup timer is running
|
||||
- Check for event loop blocking
|
||||
- Log when batch processing takes too long
|
||||
|
||||
3. **Memory Monitoring**:
|
||||
- Track heap usage over time
|
||||
- Monitor for memory leaks in long-running processes
|
||||
- Force periodic garbage collection if needed
|
||||
|
||||
### Immediate Mitigations
|
||||
|
||||
1. **Restart Cleanup Timer**:
|
||||
```typescript
|
||||
// Emergency cleanup timer restart
|
||||
if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
|
||||
cm.cleanupTimer = setTimeout(() => {
|
||||
cm.processCleanupQueue();
|
||||
}, 100);
|
||||
}
|
||||
```
|
||||
|
||||
2. **Force Periodic Cleanup**:
|
||||
```typescript
|
||||
setInterval(() => {
|
||||
const cm = connectionManager;
|
||||
if (cm.getConnectionCount() > threshold) {
|
||||
cm.performOptimizedInactivityCheck();
|
||||
// Force process cleanup queue
|
||||
cm.processCleanupQueue();
|
||||
}
|
||||
}, 300000); // Every 5 minutes
|
||||
```
|
||||
|
||||
3. **Connection Age Limits**:
|
||||
- Set maximum connection lifetime
|
||||
- Force close connections older than threshold
|
||||
- More aggressive cleanup for proxy chains
|
||||
|
||||
## ✅ FIXED: Zombie Connection Detection (January 2025)
|
||||
|
||||
### Root Cause Identified
|
||||
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.
|
||||
|
||||
### Fix Implemented
|
||||
Added zombie detection to the periodic inactivity check in ConnectionManager:
|
||||
|
||||
```typescript
|
||||
// In performOptimizedInactivityCheck()
|
||||
// Check ALL connections for zombie state
|
||||
for (const [connectionId, record] of this.connectionRecords) {
|
||||
if (!record.connectionClosed) {
|
||||
const incomingDestroyed = record.incoming?.destroyed || false;
|
||||
const outgoingDestroyed = record.outgoing?.destroyed || false;
|
||||
|
||||
// Check for zombie connections: both sockets destroyed but not cleaned up
|
||||
if (incomingDestroyed && outgoingDestroyed) {
|
||||
logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(now - record.incomingStartTime),
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up immediately
|
||||
this.cleanupConnection(record, 'zombie_cleanup');
|
||||
continue;
|
||||
}
|
||||
|
||||
// Check for half-zombie: one socket destroyed
|
||||
if (incomingDestroyed || outgoingDestroyed) {
|
||||
const age = now - record.incomingStartTime;
|
||||
// Give it 30 seconds grace period for normal cleanup
|
||||
if (age > 30000) {
|
||||
logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(age),
|
||||
incomingDestroyed,
|
||||
outgoingDestroyed,
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up
|
||||
this.cleanupConnection(record, 'half_zombie_cleanup');
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### How It Works
|
||||
1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
|
||||
2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
|
||||
3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
|
||||
4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds
|
||||
|
||||
### Why This Fixes the Outer Proxy Accumulation
|
||||
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
|
||||
- These become zombie connections that previously accumulated indefinitely
|
||||
- Now they are detected and cleaned up within 30 seconds
|
||||
|
||||
### Test Results
|
||||
Debug scripts confirmed:
|
||||
- Zombie connections can be created when sockets are destroyed directly without events
|
||||
- The zombie detection successfully identifies and cleans up these connections
|
||||
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
|
||||
|
||||
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
|
Reference in New Issue
Block a user