fix(connection): filter zombie connections part 2
This commit is contained in:
@ -548,4 +548,129 @@ Debug scripts confirmed:
|
||||
- The zombie detection successfully identifies and cleans up these connections
|
||||
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
|
||||
|
||||
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
|
||||
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
|
||||
|
||||
## 🔍 Production Diagnostics (January 2025)
|
||||
|
||||
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
|
||||
|
||||
### How to Use the Production Monitor
|
||||
|
||||
1. **Add to your proxy startup script**:
|
||||
```typescript
|
||||
import ProductionConnectionMonitor from './production-connection-monitor.js';
|
||||
|
||||
// After proxy.start()
|
||||
const monitor = new ProductionConnectionMonitor(proxy);
|
||||
monitor.start(5000); // Check every 5 seconds
|
||||
|
||||
// Monitor will automatically capture diagnostics when:
|
||||
// - Connections exceed threshold (default: 50)
|
||||
// - Sudden spike occurs (default: +20 connections)
|
||||
```
|
||||
|
||||
2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
|
||||
|
||||
3. **Force capture anytime**: `monitor.forceCaptureNow()`
|
||||
|
||||
### What the Monitor Captures
|
||||
|
||||
For each connection:
|
||||
- Socket states (destroyed, readable, writable, readyState)
|
||||
- Connection flags (closed, keepAlive, TLS status)
|
||||
- Data transfer statistics
|
||||
- Time since last activity
|
||||
- Cleanup queue status
|
||||
- Event listener counts
|
||||
- Termination reasons
|
||||
|
||||
### Pattern Analysis
|
||||
|
||||
The monitor automatically identifies:
|
||||
- **Zombie connections**: Both sockets destroyed but not cleaned up
|
||||
- **Half-zombies**: One socket destroyed
|
||||
- **Stuck connecting**: Outgoing socket stuck in connecting state
|
||||
- **No outgoing**: Missing outgoing socket
|
||||
- **Keep-alive stuck**: Keep-alive connections with no recent activity
|
||||
- **Old connections**: Connections older than 1 hour
|
||||
- **No data transfer**: Connections with no bytes transferred
|
||||
- **Listener leaks**: Excessive event listeners
|
||||
|
||||
### Common Accumulation Patterns
|
||||
|
||||
1. **Connecting State Stuck**
|
||||
- Outgoing socket shows `connecting: true` indefinitely
|
||||
- Usually means connection timeout not working
|
||||
- Check if backend is reachable
|
||||
|
||||
2. **Missing Outgoing Socket**
|
||||
- Connection has no outgoing socket but isn't closed
|
||||
- May indicate immediate routing issues
|
||||
- Check error logs during connection setup
|
||||
|
||||
3. **Event Listener Accumulation**
|
||||
- High listener counts (>20) on sockets
|
||||
- Indicates cleanup not removing all listeners
|
||||
- Can cause memory leaks
|
||||
|
||||
4. **Keep-Alive Zombies**
|
||||
- Keep-alive connections not timing out
|
||||
- Check keepAlive timeout settings
|
||||
- May need more aggressive cleanup
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. **Run the monitor in production** during accumulation
|
||||
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
|
||||
3. **Look for patterns** in the captured snapshots
|
||||
4. **Check specific connection IDs** that accumulate
|
||||
|
||||
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
|
||||
|
||||
## ✅ FIXED: Stuck Connection Detection (January 2025)
|
||||
|
||||
### Additional Root Cause Found
|
||||
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
|
||||
- Both sockets remain alive (not destroyed)
|
||||
- Keep-alive prevents normal timeout
|
||||
- No data is sent back to the client despite receiving data
|
||||
- These don't qualify as "zombies" since sockets aren't destroyed
|
||||
|
||||
### Fix Implemented
|
||||
Added stuck connection detection to the periodic inactivity check:
|
||||
|
||||
```typescript
|
||||
// Check for stuck connections: no data sent back to client
|
||||
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
|
||||
const age = now - record.incomingStartTime;
|
||||
// If connection is older than 60 seconds and no data sent back, likely stuck
|
||||
if (age > 60000) {
|
||||
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(age),
|
||||
bytesReceived: record.bytesReceived,
|
||||
targetHost: record.targetHost,
|
||||
targetPort: record.targetPort,
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up
|
||||
this.cleanupConnection(record, 'stuck_no_response');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What This Fixes
|
||||
- Connections to backends that accept but never respond
|
||||
- Proxy chains where inner proxy connects to unresponsive services
|
||||
- Scenarios where keep-alive prevents normal timeout mechanisms
|
||||
- Connections that receive client data but never send anything back
|
||||
|
||||
### Detection Criteria
|
||||
- Connection has received bytes from client (`bytesReceived > 0`)
|
||||
- No bytes sent back to client (`bytesSent === 0`)
|
||||
- Connection is older than 60 seconds
|
||||
- Both sockets are still alive (not destroyed)
|
||||
|
||||
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.
|
Reference in New Issue
Block a user