fix(readme): update
This commit is contained in:
@ -1,724 +0,0 @@
|
||||
# Connection Management in SmartProxy
|
||||
|
||||
This document describes connection handling, cleanup mechanisms, and known issues in SmartProxy, particularly focusing on proxy chain configurations.
|
||||
|
||||
## Connection Accumulation Investigation (January 2025)
|
||||
|
||||
### Problem Statement
|
||||
Connections may accumulate on the outer proxy in proxy chain configurations, despite implemented fixes.
|
||||
|
||||
### Historical Context
|
||||
- **v19.5.12-v19.5.15**: Major connection cleanup improvements
|
||||
- **v19.5.19+**: PROXY protocol support with WrappedSocket implementation
|
||||
- **v19.5.20**: Fixed race condition in immediate routing cleanup
|
||||
|
||||
### Current Architecture
|
||||
|
||||
#### Connection Flow in Proxy Chains
|
||||
```
|
||||
Client → Outer Proxy (8001) → Inner Proxy (8002) → Backend (httpbin.org:443)
|
||||
```
|
||||
|
||||
1. **Outer Proxy**:
|
||||
- Accepts client connection
|
||||
- Sends PROXY protocol header to inner proxy
|
||||
- Tracks connection in ConnectionManager
|
||||
- Immediate routing for non-TLS ports
|
||||
|
||||
2. **Inner Proxy**:
|
||||
- Parses PROXY protocol to get real client IP
|
||||
- Establishes connection to backend
|
||||
- Tracks its own connections separately
|
||||
|
||||
### Potential Causes of Connection Accumulation
|
||||
|
||||
#### 1. Race Condition in Immediate Routing
|
||||
When a connection is immediately routed (non-TLS ports), there's a timing window:
|
||||
```typescript
|
||||
// route-connection-handler.ts, line ~231
|
||||
this.routeConnection(socket, record, '', undefined);
|
||||
// Connection is routed before all setup is complete
|
||||
```
|
||||
|
||||
**Issue**: If client disconnects during backend connection setup, cleanup may not trigger properly.
|
||||
|
||||
#### 2. Outgoing Socket Assignment Timing
|
||||
Despite the fix in v19.5.20:
|
||||
```typescript
|
||||
// Line 1362 in setupDirectConnection
|
||||
record.outgoing = targetSocket;
|
||||
```
|
||||
There's still a window between socket creation and the `connect` event where cleanup might miss the outgoing socket.
|
||||
|
||||
#### 3. Batch Cleanup Delays
|
||||
ConnectionManager uses queued cleanup:
|
||||
- Batch size: 100 connections
|
||||
- Batch interval: 100ms
|
||||
- Under rapid connection/disconnection, queue might lag
|
||||
|
||||
#### 4. Different Cleanup Paths
|
||||
Multiple cleanup triggers exist:
|
||||
- Socket 'close' event
|
||||
- Socket 'error' event
|
||||
- Inactivity timeout
|
||||
- Connection timeout
|
||||
- Manual cleanup
|
||||
|
||||
Not all paths may properly handle proxy chain scenarios.
|
||||
|
||||
#### 5. Keep-Alive Connection Handling
|
||||
Keep-alive connections have special treatment:
|
||||
- Extended inactivity timeout (6x normal)
|
||||
- Warning before closure
|
||||
- May accumulate if backend is unresponsive
|
||||
|
||||
### Observed Symptoms
|
||||
|
||||
1. **Outer proxy connection count grows over time**
|
||||
2. **Inner proxy maintains zero or low connection count**
|
||||
3. **Connections show as closed in logs but remain in tracking**
|
||||
4. **Memory usage gradually increases**
|
||||
|
||||
### Debug Strategies
|
||||
|
||||
#### 1. Enhanced Logging
|
||||
Add connection state logging at key points:
|
||||
```typescript
|
||||
// When outgoing socket is created
|
||||
logger.log('debug', `Outgoing socket created for ${connectionId}`, {
|
||||
hasOutgoing: !!record.outgoing,
|
||||
outgoingState: record.outgoing?.readyState
|
||||
});
|
||||
```
|
||||
|
||||
#### 2. Connection State Inspection
|
||||
Periodically log detailed connection state:
|
||||
```typescript
|
||||
for (const [id, record] of connectionManager.getConnections()) {
|
||||
console.log({
|
||||
id,
|
||||
age: Date.now() - record.incomingStartTime,
|
||||
incomingDestroyed: record.incoming.destroyed,
|
||||
outgoingDestroyed: record.outgoing?.destroyed,
|
||||
hasCleanupTimer: !!record.cleanupTimer
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
#### 3. Cleanup Verification
|
||||
Track cleanup completion:
|
||||
```typescript
|
||||
// In cleanupConnection
|
||||
logger.log('debug', `Cleanup completed for ${record.id}`, {
|
||||
recordsRemaining: this.connectionRecords.size
|
||||
});
|
||||
```
|
||||
|
||||
### Recommendations
|
||||
|
||||
1. **Immediate Cleanup for Proxy Chains**
|
||||
- Skip batch queue for proxy chain connections
|
||||
- Use synchronous cleanup when PROXY protocol is detected
|
||||
|
||||
2. **Socket State Validation**
|
||||
- Check both `destroyed` and `readyState` before cleanup decisions
|
||||
- Handle 'opening' state sockets explicitly
|
||||
|
||||
3. **Timeout Adjustments**
|
||||
- Shorter timeouts for proxy chain connections
|
||||
- More aggressive cleanup for connections without data transfer
|
||||
|
||||
4. **Connection Limits**
|
||||
- Per-route connection limits
|
||||
- Backpressure when approaching limits
|
||||
|
||||
5. **Monitoring**
|
||||
- Export connection metrics
|
||||
- Alert on connection count thresholds
|
||||
- Track connection age distribution
|
||||
|
||||
### Test Scenarios to Reproduce
|
||||
|
||||
1. **Rapid Connect/Disconnect**
|
||||
```bash
|
||||
# Create many short-lived connections
|
||||
for i in {1..1000}; do
|
||||
(echo -n | nc localhost 8001) &
|
||||
done
|
||||
```
|
||||
|
||||
2. **Slow Backend**
|
||||
- Configure inner proxy to connect to unresponsive backend
|
||||
- Monitor outer proxy connection count
|
||||
|
||||
3. **Mixed Traffic**
|
||||
- Combine TLS and non-TLS connections
|
||||
- Add keep-alive connections
|
||||
- Observe accumulation patterns
|
||||
|
||||
### Future Improvements
|
||||
|
||||
1. **Connection Pool Isolation**
|
||||
- Separate pools for proxy chain vs direct connections
|
||||
- Different cleanup strategies per pool
|
||||
|
||||
2. **Circuit Breaker**
|
||||
- Detect accumulation and trigger aggressive cleanup
|
||||
- Temporary refuse new connections when near limit
|
||||
|
||||
3. **Connection State Machine**
|
||||
- Explicit states: CONNECTING, ESTABLISHED, CLOSING, CLOSED
|
||||
- State transition validation
|
||||
- Timeout per state
|
||||
|
||||
4. **Metrics Collection**
|
||||
- Connection lifecycle events
|
||||
- Cleanup success/failure rates
|
||||
- Time spent in each state
|
||||
|
||||
### Root Cause Identified (January 2025)
|
||||
|
||||
**The primary issue is on the inner proxy when backends are unreachable:**
|
||||
|
||||
When the backend is unreachable (e.g., non-routable IP like 10.255.255.1):
|
||||
1. The outgoing socket gets stuck in "opening" state indefinitely
|
||||
2. The `createSocketWithErrorHandler` in socket-utils.ts doesn't implement connection timeout
|
||||
3. `socket.setTimeout()` only handles inactivity AFTER connection, not during connect phase
|
||||
4. Connections accumulate because they never transition to error state
|
||||
5. Socket timeout warnings fire but connections are preserved as keep-alive
|
||||
|
||||
**Code Issue:**
|
||||
```typescript
|
||||
// socket-utils.ts line 275
|
||||
if (timeout) {
|
||||
socket.setTimeout(timeout); // This only handles inactivity, not connection!
|
||||
}
|
||||
```
|
||||
|
||||
**Required Fix:**
|
||||
|
||||
1. Add `connectionTimeout` to ISmartProxyOptions interface:
|
||||
```typescript
|
||||
// In interfaces.ts
|
||||
connectionTimeout?: number; // Timeout for establishing connection (ms), default: 30000 (30s)
|
||||
```
|
||||
|
||||
2. Update `createSocketWithErrorHandler` in socket-utils.ts:
|
||||
```typescript
|
||||
export function createSocketWithErrorHandler(options: SafeSocketOptions): plugins.net.Socket {
|
||||
const { port, host, onError, onConnect, timeout } = options;
|
||||
|
||||
const socket = new plugins.net.Socket();
|
||||
let connected = false;
|
||||
let connectionTimeout: NodeJS.Timeout | null = null;
|
||||
|
||||
socket.on('error', (error) => {
|
||||
if (connectionTimeout) {
|
||||
clearTimeout(connectionTimeout);
|
||||
connectionTimeout = null;
|
||||
}
|
||||
if (onError) onError(error);
|
||||
});
|
||||
|
||||
socket.on('connect', () => {
|
||||
connected = true;
|
||||
if (connectionTimeout) {
|
||||
clearTimeout(connectionTimeout);
|
||||
connectionTimeout = null;
|
||||
}
|
||||
if (timeout) socket.setTimeout(timeout); // Set inactivity timeout
|
||||
if (onConnect) onConnect();
|
||||
});
|
||||
|
||||
// Implement connection establishment timeout
|
||||
if (timeout) {
|
||||
connectionTimeout = setTimeout(() => {
|
||||
if (!connected && !socket.destroyed) {
|
||||
const error = new Error(`Connection timeout after ${timeout}ms to ${host}:${port}`);
|
||||
(error as any).code = 'ETIMEDOUT';
|
||||
socket.destroy();
|
||||
if (onError) onError(error);
|
||||
}
|
||||
}, timeout);
|
||||
}
|
||||
|
||||
socket.connect(port, host);
|
||||
return socket;
|
||||
}
|
||||
```
|
||||
|
||||
3. Pass connectionTimeout in route-connection-handler.ts:
|
||||
```typescript
|
||||
const targetSocket = createSocketWithErrorHandler({
|
||||
port: finalTargetPort,
|
||||
host: finalTargetHost,
|
||||
timeout: this.settings.connectionTimeout || 30000, // Connection timeout
|
||||
onError: (error) => { /* existing */ },
|
||||
onConnect: async () => { /* existing */ }
|
||||
});
|
||||
```
|
||||
|
||||
### Investigation Results (January 2025)
|
||||
|
||||
Based on extensive testing with debug scripts:
|
||||
|
||||
1. **Normal Operation**: In controlled tests, connections are properly cleaned up:
|
||||
- Immediate routing cleanup handler properly destroys outgoing connections
|
||||
- Both outer and inner proxies maintain 0 connections after clients disconnect
|
||||
- Keep-alive connections are tracked and cleaned up correctly
|
||||
|
||||
2. **Potential Edge Cases Not Covered by Tests**:
|
||||
- **HTTP/2 Connections**: May have different lifecycle than HTTP/1.1
|
||||
- **WebSocket Connections**: Long-lived upgrade connections might persist
|
||||
- **Partial TLS Handshakes**: Connections that start TLS but don't complete
|
||||
- **PROXY Protocol Parse Failures**: Malformed headers from untrusted sources
|
||||
- **Connection Pool Reuse**: HttpProxy component may maintain its own pools
|
||||
|
||||
3. **Timing-Sensitive Scenarios**:
|
||||
- Client disconnects exactly when `record.outgoing` is being assigned
|
||||
- Backend connects but immediately RSTs
|
||||
- Proxy chain where middle proxy restarts
|
||||
- Multiple rapid reconnects with same source IP/port
|
||||
|
||||
4. **Configuration-Specific Issues**:
|
||||
- Mixed `sendProxyProtocol` settings in chain
|
||||
- Different `keepAlive` settings between proxies
|
||||
- Mismatched timeout values
|
||||
- Routes with `forwardingEngine: 'nftables'`
|
||||
|
||||
### Additional Debug Points
|
||||
|
||||
Add these debug logs to identify the specific scenario:
|
||||
|
||||
```typescript
|
||||
// In route-connection-handler.ts setupDirectConnection
|
||||
logger.log('debug', `Setting outgoing socket for ${connectionId}`, {
|
||||
timestamp: Date.now(),
|
||||
hasOutgoing: !!record.outgoing,
|
||||
socketState: targetSocket.readyState
|
||||
});
|
||||
|
||||
// In connection-manager.ts cleanupConnection
|
||||
logger.log('debug', `Cleanup attempt for ${record.id}`, {
|
||||
alreadyClosed: record.connectionClosed,
|
||||
hasIncoming: !!record.incoming,
|
||||
hasOutgoing: !!record.outgoing,
|
||||
incomingDestroyed: record.incoming?.destroyed,
|
||||
outgoingDestroyed: record.outgoing?.destroyed
|
||||
});
|
||||
```
|
||||
|
||||
### Workarounds
|
||||
|
||||
Until root cause is identified:
|
||||
|
||||
1. **Periodic Force Cleanup**:
|
||||
```typescript
|
||||
setInterval(() => {
|
||||
const connections = connectionManager.getConnections();
|
||||
for (const [id, record] of connections) {
|
||||
if (record.incoming?.destroyed && !record.connectionClosed) {
|
||||
connectionManager.cleanupConnection(record, 'force_cleanup');
|
||||
}
|
||||
}
|
||||
}, 60000); // Every minute
|
||||
```
|
||||
|
||||
2. **Connection Age Limit**:
|
||||
```typescript
|
||||
// Add max connection age check
|
||||
const maxAge = 3600000; // 1 hour
|
||||
if (Date.now() - record.incomingStartTime > maxAge) {
|
||||
connectionManager.cleanupConnection(record, 'max_age');
|
||||
}
|
||||
```
|
||||
|
||||
3. **Aggressive Timeout Settings**:
|
||||
```typescript
|
||||
{
|
||||
socketTimeout: 60000, // 1 minute
|
||||
inactivityTimeout: 300000, // 5 minutes
|
||||
connectionCleanupInterval: 30000 // 30 seconds
|
||||
}
|
||||
```
|
||||
|
||||
### Related Files
|
||||
- `/ts/proxies/smart-proxy/route-connection-handler.ts` - Main connection handling
|
||||
- `/ts/proxies/smart-proxy/connection-manager.ts` - Connection tracking and cleanup
|
||||
- `/ts/core/utils/socket-utils.ts` - Socket cleanup utilities
|
||||
- `/test/test.proxy-chain-cleanup.node.ts` - Test for connection cleanup
|
||||
- `/test/test.proxy-chaining-accumulation.node.ts` - Test for accumulation prevention
|
||||
- `/.nogit/debug/connection-accumulation-debug.ts` - Debug script for connection states
|
||||
- `/.nogit/debug/connection-accumulation-keepalive.ts` - Keep-alive specific tests
|
||||
- `/.nogit/debug/connection-accumulation-http.ts` - HTTP traffic through proxy chains
|
||||
|
||||
### Summary
|
||||
|
||||
**Issue Identified**: Connection accumulation occurs on the **inner proxy** (not outer) when backends are unreachable.
|
||||
|
||||
**Root Cause**: The `createSocketWithErrorHandler` function in socket-utils.ts doesn't implement connection establishment timeout. It only sets `socket.setTimeout()` which handles inactivity AFTER connection is established, not during the connect phase.
|
||||
|
||||
**Impact**: When connecting to unreachable IPs (e.g., 10.255.255.1), outgoing sockets remain in "opening" state indefinitely, causing connections to accumulate.
|
||||
|
||||
**Fix Required**:
|
||||
1. Add `connectionTimeout` setting to ISmartProxyOptions
|
||||
2. Implement proper connection timeout in `createSocketWithErrorHandler`
|
||||
3. Pass the timeout value from route-connection-handler
|
||||
|
||||
**Workaround Until Fixed**: Configure shorter socket timeouts and use the periodic force cleanup suggested above.
|
||||
|
||||
The connection cleanup mechanisms have been significantly improved in v19.5.20:
|
||||
1. Race condition fixed by setting `record.outgoing` before connecting
|
||||
2. Immediate routing cleanup handler always destroys outgoing connections
|
||||
3. Tests confirm no accumulation in standard scenarios with reachable backends
|
||||
|
||||
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
|
||||
|
||||
### Outer Proxy Sudden Accumulation After Hours
|
||||
|
||||
**User Report**: "The counter goes up suddenly after some hours on the outer proxy"
|
||||
|
||||
**Investigation Findings**:
|
||||
|
||||
1. **Cleanup Queue Mechanism**:
|
||||
- Connections are cleaned up in batches of 100 via a queue
|
||||
- If the cleanup timer gets stuck or cleared without restart, connections accumulate
|
||||
- The timer is set with `setTimeout` and could be affected by event loop blocking
|
||||
|
||||
2. **Potential Causes for Sudden Spikes**:
|
||||
|
||||
a) **Cleanup Timer Failure**:
|
||||
```typescript
|
||||
// In ConnectionManager, if this timer gets cleared but not restarted:
|
||||
this.cleanupTimer = this.setTimeout(() => {
|
||||
this.processCleanupQueue();
|
||||
}, 100);
|
||||
```
|
||||
|
||||
b) **Memory Pressure**:
|
||||
- After hours of operation, memory fragmentation or pressure could cause delays
|
||||
- Garbage collection pauses might interfere with timer execution
|
||||
|
||||
c) **Event Listener Accumulation**:
|
||||
- Socket event listeners might accumulate over time
|
||||
- Server 'connection' event handlers are particularly important
|
||||
|
||||
d) **Keep-Alive Connection Cascades**:
|
||||
- When many keep-alive connections timeout simultaneously
|
||||
- Outer proxy has different timeout than inner proxy
|
||||
- Mass disconnection events can overwhelm cleanup queue
|
||||
|
||||
e) **HttpProxy Component Issues**:
|
||||
- If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
|
||||
- These pools might not be properly cleaned after hours
|
||||
|
||||
3. **Why "Sudden" After Hours**:
|
||||
- Not a gradual leak but triggered by specific conditions
|
||||
- Likely related to periodic events or thresholds:
|
||||
- Inactivity check runs every 30 seconds
|
||||
- Keep-alive connections have extended timeouts (6x normal)
|
||||
- Parity check has 30-minute timeout for half-closed connections
|
||||
|
||||
4. **Reproduction Scenarios**:
|
||||
- Mass client disconnection/reconnection (network blip)
|
||||
- Keep-alive timeout cascade when inner proxy times out first
|
||||
- Cleanup timer getting stuck during high load
|
||||
- Memory pressure causing event loop delays
|
||||
|
||||
### Additional Monitoring Recommendations
|
||||
|
||||
1. **Add Cleanup Queue Monitoring**:
|
||||
```typescript
|
||||
setInterval(() => {
|
||||
const cm = proxy.connectionManager;
|
||||
if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
|
||||
logger.error('Cleanup queue stuck!', {
|
||||
queueSize: cm.cleanupQueue.size,
|
||||
hasTimer: !!cm.cleanupTimer
|
||||
});
|
||||
}
|
||||
}, 60000);
|
||||
```
|
||||
|
||||
2. **Track Timer Health**:
|
||||
- Monitor if cleanup timer is running
|
||||
- Check for event loop blocking
|
||||
- Log when batch processing takes too long
|
||||
|
||||
3. **Memory Monitoring**:
|
||||
- Track heap usage over time
|
||||
- Monitor for memory leaks in long-running processes
|
||||
- Force periodic garbage collection if needed
|
||||
|
||||
### Immediate Mitigations
|
||||
|
||||
1. **Restart Cleanup Timer**:
|
||||
```typescript
|
||||
// Emergency cleanup timer restart
|
||||
if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
|
||||
cm.cleanupTimer = setTimeout(() => {
|
||||
cm.processCleanupQueue();
|
||||
}, 100);
|
||||
}
|
||||
```
|
||||
|
||||
2. **Force Periodic Cleanup**:
|
||||
```typescript
|
||||
setInterval(() => {
|
||||
const cm = connectionManager;
|
||||
if (cm.getConnectionCount() > threshold) {
|
||||
cm.performOptimizedInactivityCheck();
|
||||
// Force process cleanup queue
|
||||
cm.processCleanupQueue();
|
||||
}
|
||||
}, 300000); // Every 5 minutes
|
||||
```
|
||||
|
||||
3. **Connection Age Limits**:
|
||||
- Set maximum connection lifetime
|
||||
- Force close connections older than threshold
|
||||
- More aggressive cleanup for proxy chains
|
||||
|
||||
## ✅ FIXED: Zombie Connection Detection (January 2025)
|
||||
|
||||
### Root Cause Identified
|
||||
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.
|
||||
|
||||
### Fix Implemented
|
||||
Added zombie detection to the periodic inactivity check in ConnectionManager:
|
||||
|
||||
```typescript
|
||||
// In performOptimizedInactivityCheck()
|
||||
// Check ALL connections for zombie state
|
||||
for (const [connectionId, record] of this.connectionRecords) {
|
||||
if (!record.connectionClosed) {
|
||||
const incomingDestroyed = record.incoming?.destroyed || false;
|
||||
const outgoingDestroyed = record.outgoing?.destroyed || false;
|
||||
|
||||
// Check for zombie connections: both sockets destroyed but not cleaned up
|
||||
if (incomingDestroyed && outgoingDestroyed) {
|
||||
logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(now - record.incomingStartTime),
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up immediately
|
||||
this.cleanupConnection(record, 'zombie_cleanup');
|
||||
continue;
|
||||
}
|
||||
|
||||
// Check for half-zombie: one socket destroyed
|
||||
if (incomingDestroyed || outgoingDestroyed) {
|
||||
const age = now - record.incomingStartTime;
|
||||
// Give it 30 seconds grace period for normal cleanup
|
||||
if (age > 30000) {
|
||||
logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(age),
|
||||
incomingDestroyed,
|
||||
outgoingDestroyed,
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up
|
||||
this.cleanupConnection(record, 'half_zombie_cleanup');
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### How It Works
|
||||
1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
|
||||
2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
|
||||
3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
|
||||
4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds
|
||||
|
||||
### Why This Fixes the Outer Proxy Accumulation
|
||||
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
|
||||
- These become zombie connections that previously accumulated indefinitely
|
||||
- Now they are detected and cleaned up within 30 seconds
|
||||
|
||||
### Test Results
|
||||
Debug scripts confirmed:
|
||||
- Zombie connections can be created when sockets are destroyed directly without events
|
||||
- The zombie detection successfully identifies and cleans up these connections
|
||||
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
|
||||
|
||||
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
|
||||
|
||||
## 🔍 Production Diagnostics (January 2025)
|
||||
|
||||
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
|
||||
|
||||
### How to Use the Production Monitor
|
||||
|
||||
1. **Add to your proxy startup script**:
|
||||
```typescript
|
||||
import ProductionConnectionMonitor from './production-connection-monitor.js';
|
||||
|
||||
// After proxy.start()
|
||||
const monitor = new ProductionConnectionMonitor(proxy);
|
||||
monitor.start(5000); // Check every 5 seconds
|
||||
|
||||
// Monitor will automatically capture diagnostics when:
|
||||
// - Connections exceed threshold (default: 50)
|
||||
// - Sudden spike occurs (default: +20 connections)
|
||||
```
|
||||
|
||||
2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
|
||||
|
||||
3. **Force capture anytime**: `monitor.forceCaptureNow()`
|
||||
|
||||
### What the Monitor Captures
|
||||
|
||||
For each connection:
|
||||
- Socket states (destroyed, readable, writable, readyState)
|
||||
- Connection flags (closed, keepAlive, TLS status)
|
||||
- Data transfer statistics
|
||||
- Time since last activity
|
||||
- Cleanup queue status
|
||||
- Event listener counts
|
||||
- Termination reasons
|
||||
|
||||
### Pattern Analysis
|
||||
|
||||
The monitor automatically identifies:
|
||||
- **Zombie connections**: Both sockets destroyed but not cleaned up
|
||||
- **Half-zombies**: One socket destroyed
|
||||
- **Stuck connecting**: Outgoing socket stuck in connecting state
|
||||
- **No outgoing**: Missing outgoing socket
|
||||
- **Keep-alive stuck**: Keep-alive connections with no recent activity
|
||||
- **Old connections**: Connections older than 1 hour
|
||||
- **No data transfer**: Connections with no bytes transferred
|
||||
- **Listener leaks**: Excessive event listeners
|
||||
|
||||
### Common Accumulation Patterns
|
||||
|
||||
1. **Connecting State Stuck**
|
||||
- Outgoing socket shows `connecting: true` indefinitely
|
||||
- Usually means connection timeout not working
|
||||
- Check if backend is reachable
|
||||
|
||||
2. **Missing Outgoing Socket**
|
||||
- Connection has no outgoing socket but isn't closed
|
||||
- May indicate immediate routing issues
|
||||
- Check error logs during connection setup
|
||||
|
||||
3. **Event Listener Accumulation**
|
||||
- High listener counts (>20) on sockets
|
||||
- Indicates cleanup not removing all listeners
|
||||
- Can cause memory leaks
|
||||
|
||||
4. **Keep-Alive Zombies**
|
||||
- Keep-alive connections not timing out
|
||||
- Check keepAlive timeout settings
|
||||
- May need more aggressive cleanup
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. **Run the monitor in production** during accumulation
|
||||
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
|
||||
3. **Look for patterns** in the captured snapshots
|
||||
4. **Check specific connection IDs** that accumulate
|
||||
|
||||
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
|
||||
|
||||
## ✅ FIXED: Stuck Connection Detection (January 2025)
|
||||
|
||||
### Additional Root Cause Found
|
||||
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
|
||||
- Both sockets remain alive (not destroyed)
|
||||
- Keep-alive prevents normal timeout
|
||||
- No data is sent back to the client despite receiving data
|
||||
- These don't qualify as "zombies" since sockets aren't destroyed
|
||||
|
||||
### Fix Implemented
|
||||
Added stuck connection detection to the periodic inactivity check:
|
||||
|
||||
```typescript
|
||||
// Check for stuck connections: no data sent back to client
|
||||
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
|
||||
const age = now - record.incomingStartTime;
|
||||
// If connection is older than 60 seconds and no data sent back, likely stuck
|
||||
if (age > 60000) {
|
||||
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
|
||||
connectionId,
|
||||
remoteIP: record.remoteIP,
|
||||
age: plugins.prettyMs(age),
|
||||
bytesReceived: record.bytesReceived,
|
||||
targetHost: record.targetHost,
|
||||
targetPort: record.targetPort,
|
||||
component: 'connection-manager'
|
||||
});
|
||||
|
||||
// Clean up
|
||||
this.cleanupConnection(record, 'stuck_no_response');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What This Fixes
|
||||
- Connections to backends that accept but never respond
|
||||
- Proxy chains where inner proxy connects to unresponsive services
|
||||
- Scenarios where keep-alive prevents normal timeout mechanisms
|
||||
- Connections that receive client data but never send anything back
|
||||
|
||||
### Detection Criteria
|
||||
- Connection has received bytes from client (`bytesReceived > 0`)
|
||||
- No bytes sent back to client (`bytesSent === 0`)
|
||||
- Connection is older than 60 seconds
|
||||
- Both sockets are still alive (not destroyed)
|
||||
|
||||
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.
|
||||
|
||||
## 🚨 CRITICAL FIX: Cleanup Queue Bug (January 2025)
|
||||
|
||||
### Critical Bug Found
|
||||
The cleanup queue had a severe bug that caused connection accumulation when more than 100 connections needed cleanup:
|
||||
|
||||
```typescript
|
||||
// BUG: This cleared the ENTIRE queue after processing only the first batch!
|
||||
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
|
||||
this.cleanupQueue.clear(); // ❌ This discarded all connections beyond the first 100!
|
||||
```
|
||||
|
||||
### Fix Implemented
|
||||
```typescript
|
||||
// Now only removes the connections being processed
|
||||
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
|
||||
for (const connectionId of toCleanup) {
|
||||
this.cleanupQueue.delete(connectionId); // ✅ Only remove what we process
|
||||
const record = this.connectionRecords.get(connectionId);
|
||||
if (record) {
|
||||
this.cleanupConnection(record, record.incomingTerminationReason || 'normal');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Impact
|
||||
- **Before**: If 150 connections needed cleanup, only the first 100 would be processed and the remaining 50 would accumulate forever
|
||||
- **After**: All connections are properly cleaned up in batches
|
||||
|
||||
### Additional Improvements
|
||||
|
||||
1. **Faster Inactivity Checks**: Reduced from 30s to 10s intervals
|
||||
- Zombies and stuck connections are detected 3x faster
|
||||
- Reduces the window for accumulation
|
||||
|
||||
2. **Duplicate Prevention**: Added check in queueCleanup to prevent processing already-closed connections
|
||||
- Prevents unnecessary work
|
||||
- Ensures connections are only cleaned up once
|
||||
|
||||
### Summary of All Fixes
|
||||
|
||||
1. **Connection Timeout** (already documented) - Prevents accumulation when backends are unreachable
|
||||
2. **Zombie Detection** - Cleans up connections with destroyed sockets
|
||||
3. **Stuck Connection Detection** - Cleans up connections to hanging backends
|
||||
4. **Cleanup Queue Bug** - Ensures ALL connections get cleaned up, not just the first 100
|
||||
5. **Faster Detection** - Reduced check interval from 30s to 10s
|
||||
|
||||
These fixes combined should prevent connection accumulation in all known scenarios.
|
Reference in New Issue
Block a user