fix(connection): filter zombie connections part 2

This commit is contained in:
Juergen Kunz
2025-06-07 20:37:49 +00:00
parent 19590ef107
commit 890e907664
4 changed files with 498 additions and 1 deletions

View File

@ -548,4 +548,129 @@ Debug scripts confirmed:
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
## 🔍 Production Diagnostics (January 2025)
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
### How to Use the Production Monitor
1. **Add to your proxy startup script**:
```typescript
import ProductionConnectionMonitor from './production-connection-monitor.js';
// After proxy.start()
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// Monitor will automatically capture diagnostics when:
// - Connections exceed threshold (default: 50)
// - Sudden spike occurs (default: +20 connections)
```
2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
3. **Force capture anytime**: `monitor.forceCaptureNow()`
### What the Monitor Captures
For each connection:
- Socket states (destroyed, readable, writable, readyState)
- Connection flags (closed, keepAlive, TLS status)
- Data transfer statistics
- Time since last activity
- Cleanup queue status
- Event listener counts
- Termination reasons
### Pattern Analysis
The monitor automatically identifies:
- **Zombie connections**: Both sockets destroyed but not cleaned up
- **Half-zombies**: One socket destroyed
- **Stuck connecting**: Outgoing socket stuck in connecting state
- **No outgoing**: Missing outgoing socket
- **Keep-alive stuck**: Keep-alive connections with no recent activity
- **Old connections**: Connections older than 1 hour
- **No data transfer**: Connections with no bytes transferred
- **Listener leaks**: Excessive event listeners
### Common Accumulation Patterns
1. **Connecting State Stuck**
- Outgoing socket shows `connecting: true` indefinitely
- Usually means connection timeout not working
- Check if backend is reachable
2. **Missing Outgoing Socket**
- Connection has no outgoing socket but isn't closed
- May indicate immediate routing issues
- Check error logs during connection setup
3. **Event Listener Accumulation**
- High listener counts (>20) on sockets
- Indicates cleanup not removing all listeners
- Can cause memory leaks
4. **Keep-Alive Zombies**
- Keep-alive connections not timing out
- Check keepAlive timeout settings
- May need more aggressive cleanup
### Next Steps
1. **Run the monitor in production** during accumulation
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
3. **Look for patterns** in the captured snapshots
4. **Check specific connection IDs** that accumulate
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
## ✅ FIXED: Stuck Connection Detection (January 2025)
### Additional Root Cause Found
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
- Both sockets remain alive (not destroyed)
- Keep-alive prevents normal timeout
- No data is sent back to the client despite receiving data
- These don't qualify as "zombies" since sockets aren't destroyed
### Fix Implemented
Added stuck connection detection to the periodic inactivity check:
```typescript
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
```
### What This Fixes
- Connections to backends that accept but never respond
- Proxy chains where inner proxy connects to unresponsive services
- Scenarios where keep-alive prevents normal timeout mechanisms
- Connections that receive client data but never send anything back
### Detection Criteria
- Connection has received bytes from client (`bytesReceived > 0`)
- No bytes sent back to client (`bytesSent === 0`)
- Connection is older than 60 seconds
- Both sockets are still alive (not destroyed)
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.