2025-06-06 23:34:50 +00:00
# Connection Management in SmartProxy
This document describes connection handling, cleanup mechanisms, and known issues in SmartProxy, particularly focusing on proxy chain configurations.
## Connection Accumulation Investigation (January 2025)
### Problem Statement
Connections may accumulate on the outer proxy in proxy chain configurations, despite implemented fixes.
### Historical Context
- **v19.5.12-v19.5.15**: Major connection cleanup improvements
- **v19.5.19+**: PROXY protocol support with WrappedSocket implementation
- **v19.5.20**: Fixed race condition in immediate routing cleanup
### Current Architecture
#### Connection Flow in Proxy Chains
```
Client → Outer Proxy (8001) → Inner Proxy (8002) → Backend (httpbin.org:443)
```
1. **Outer Proxy** :
- Accepts client connection
- Sends PROXY protocol header to inner proxy
- Tracks connection in ConnectionManager
- Immediate routing for non-TLS ports
2. **Inner Proxy** :
- Parses PROXY protocol to get real client IP
- Establishes connection to backend
- Tracks its own connections separately
### Potential Causes of Connection Accumulation
#### 1. Race Condition in Immediate Routing
When a connection is immediately routed (non-TLS ports), there's a timing window:
```typescript
// route-connection-handler.ts, line ~231
this.routeConnection(socket, record, '', undefined);
// Connection is routed before all setup is complete
```
**Issue**: If client disconnects during backend connection setup, cleanup may not trigger properly.
#### 2. Outgoing Socket Assignment Timing
Despite the fix in v19.5.20:
```typescript
// Line 1362 in setupDirectConnection
record.outgoing = targetSocket;
```
There's still a window between socket creation and the `connect` event where cleanup might miss the outgoing socket.
#### 3. Batch Cleanup Delays
ConnectionManager uses queued cleanup:
- Batch size: 100 connections
- Batch interval: 100ms
- Under rapid connection/disconnection, queue might lag
#### 4. Different Cleanup Paths
Multiple cleanup triggers exist:
- Socket 'close' event
- Socket 'error' event
- Inactivity timeout
- Connection timeout
- Manual cleanup
Not all paths may properly handle proxy chain scenarios.
#### 5. Keep-Alive Connection Handling
Keep-alive connections have special treatment:
- Extended inactivity timeout (6x normal)
- Warning before closure
- May accumulate if backend is unresponsive
### Observed Symptoms
1. **Outer proxy connection count grows over time**
2. **Inner proxy maintains zero or low connection count**
3. **Connections show as closed in logs but remain in tracking**
4. **Memory usage gradually increases**
### Debug Strategies
#### 1. Enhanced Logging
Add connection state logging at key points:
```typescript
// When outgoing socket is created
logger.log('debug', `Outgoing socket created for ${connectionId}` , {
hasOutgoing: !!record.outgoing,
outgoingState: record.outgoing?.readyState
});
```
#### 2. Connection State Inspection
Periodically log detailed connection state:
```typescript
for (const [id, record] of connectionManager.getConnections()) {
console.log({
id,
age: Date.now() - record.incomingStartTime,
incomingDestroyed: record.incoming.destroyed,
outgoingDestroyed: record.outgoing?.destroyed,
hasCleanupTimer: !!record.cleanupTimer
});
}
```
#### 3. Cleanup Verification
Track cleanup completion:
```typescript
// In cleanupConnection
logger.log('debug', `Cleanup completed for ${record.id}` , {
recordsRemaining: this.connectionRecords.size
});
```
### Recommendations
1. **Immediate Cleanup for Proxy Chains**
- Skip batch queue for proxy chain connections
- Use synchronous cleanup when PROXY protocol is detected
2. **Socket State Validation**
- Check both `destroyed` and `readyState` before cleanup decisions
- Handle 'opening' state sockets explicitly
3. **Timeout Adjustments**
- Shorter timeouts for proxy chain connections
- More aggressive cleanup for connections without data transfer
4. **Connection Limits**
- Per-route connection limits
- Backpressure when approaching limits
5. **Monitoring**
- Export connection metrics
- Alert on connection count thresholds
- Track connection age distribution
### Test Scenarios to Reproduce
1. **Rapid Connect/Disconnect**
```bash
# Create many short-lived connections
for i in {1..1000}; do
(echo -n | nc localhost 8001) &
done
```
2. **Slow Backend**
- Configure inner proxy to connect to unresponsive backend
- Monitor outer proxy connection count
3. **Mixed Traffic**
- Combine TLS and non-TLS connections
- Add keep-alive connections
- Observe accumulation patterns
### Future Improvements
1. **Connection Pool Isolation**
- Separate pools for proxy chain vs direct connections
- Different cleanup strategies per pool
2. **Circuit Breaker**
- Detect accumulation and trigger aggressive cleanup
- Temporary refuse new connections when near limit
3. **Connection State Machine**
- Explicit states: CONNECTING, ESTABLISHED, CLOSING, CLOSED
- State transition validation
- Timeout per state
4. **Metrics Collection**
- Connection lifecycle events
- Cleanup success/failure rates
- Time spent in each state
### Root Cause Identified (January 2025)
**The primary issue is on the inner proxy when backends are unreachable:**
When the backend is unreachable (e.g., non-routable IP like 10.255.255.1):
1. The outgoing socket gets stuck in "opening" state indefinitely
2. The `createSocketWithErrorHandler` in socket-utils.ts doesn't implement connection timeout
3. `socket.setTimeout()` only handles inactivity AFTER connection, not during connect phase
4. Connections accumulate because they never transition to error state
5. Socket timeout warnings fire but connections are preserved as keep-alive
**Code Issue:**
```typescript
// socket-utils.ts line 275
if (timeout) {
socket.setTimeout(timeout); // This only handles inactivity, not connection!
}
```
**Required Fix:**
1. Add `connectionTimeout` to ISmartProxyOptions interface:
```typescript
// In interfaces.ts
connectionTimeout?: number; // Timeout for establishing connection (ms), default: 30000 (30s)
```
2. Update `createSocketWithErrorHandler` in socket-utils.ts:
```typescript
export function createSocketWithErrorHandler(options: SafeSocketOptions): plugins.net.Socket {
const { port, host, onError, onConnect, timeout } = options;
const socket = new plugins.net.Socket();
let connected = false;
let connectionTimeout: NodeJS.Timeout | null = null;
socket.on('error', (error) => {
if (connectionTimeout) {
clearTimeout(connectionTimeout);
connectionTimeout = null;
}
if (onError) onError(error);
});
socket.on('connect', () => {
connected = true;
if (connectionTimeout) {
clearTimeout(connectionTimeout);
connectionTimeout = null;
}
if (timeout) socket.setTimeout(timeout); // Set inactivity timeout
if (onConnect) onConnect();
});
// Implement connection establishment timeout
if (timeout) {
connectionTimeout = setTimeout(() => {
if (!connected & & !socket.destroyed) {
const error = new Error(`Connection timeout after ${timeout}ms to ${host}:${port}` );
(error as any).code = 'ETIMEDOUT';
socket.destroy();
if (onError) onError(error);
}
}, timeout);
}
socket.connect(port, host);
return socket;
}
```
3. Pass connectionTimeout in route-connection-handler.ts:
```typescript
const targetSocket = createSocketWithErrorHandler({
port: finalTargetPort,
host: finalTargetHost,
timeout: this.settings.connectionTimeout || 30000, // Connection timeout
onError: (error) => { /* existing */ },
onConnect: async () => { /* existing */ }
});
```
### Investigation Results (January 2025)
Based on extensive testing with debug scripts:
1. **Normal Operation** : In controlled tests, connections are properly cleaned up:
- Immediate routing cleanup handler properly destroys outgoing connections
- Both outer and inner proxies maintain 0 connections after clients disconnect
- Keep-alive connections are tracked and cleaned up correctly
2. **Potential Edge Cases Not Covered by Tests** :
- **HTTP/2 Connections**: May have different lifecycle than HTTP/1.1
- **WebSocket Connections**: Long-lived upgrade connections might persist
- **Partial TLS Handshakes**: Connections that start TLS but don't complete
- **PROXY Protocol Parse Failures**: Malformed headers from untrusted sources
- **Connection Pool Reuse**: HttpProxy component may maintain its own pools
3. **Timing-Sensitive Scenarios** :
- Client disconnects exactly when `record.outgoing` is being assigned
- Backend connects but immediately RSTs
- Proxy chain where middle proxy restarts
- Multiple rapid reconnects with same source IP/port
4. **Configuration-Specific Issues** :
- Mixed `sendProxyProtocol` settings in chain
- Different `keepAlive` settings between proxies
- Mismatched timeout values
- Routes with `forwardingEngine: 'nftables'`
### Additional Debug Points
Add these debug logs to identify the specific scenario:
```typescript
// In route-connection-handler.ts setupDirectConnection
logger.log('debug', `Setting outgoing socket for ${connectionId}` , {
timestamp: Date.now(),
hasOutgoing: !!record.outgoing,
socketState: targetSocket.readyState
});
// In connection-manager.ts cleanupConnection
logger.log('debug', `Cleanup attempt for ${record.id}` , {
alreadyClosed: record.connectionClosed,
hasIncoming: !!record.incoming,
hasOutgoing: !!record.outgoing,
incomingDestroyed: record.incoming?.destroyed,
outgoingDestroyed: record.outgoing?.destroyed
});
```
### Workarounds
Until root cause is identified:
1. **Periodic Force Cleanup** :
```typescript
setInterval(() => {
const connections = connectionManager.getConnections();
for (const [id, record] of connections) {
if (record.incoming?.destroyed & & !record.connectionClosed) {
connectionManager.cleanupConnection(record, 'force_cleanup');
}
}
}, 60000); // Every minute
```
2. **Connection Age Limit** :
```typescript
// Add max connection age check
const maxAge = 3600000; // 1 hour
if (Date.now() - record.incomingStartTime > maxAge) {
connectionManager.cleanupConnection(record, 'max_age');
}
```
3. **Aggressive Timeout Settings** :
```typescript
{
socketTimeout: 60000, // 1 minute
inactivityTimeout: 300000, // 5 minutes
connectionCleanupInterval: 30000 // 30 seconds
}
```
### Related Files
- `/ts/proxies/smart-proxy/route-connection-handler.ts` - Main connection handling
- `/ts/proxies/smart-proxy/connection-manager.ts` - Connection tracking and cleanup
- `/ts/core/utils/socket-utils.ts` - Socket cleanup utilities
- `/test/test.proxy-chain-cleanup.node.ts` - Test for connection cleanup
- `/test/test.proxy-chaining-accumulation.node.ts` - Test for accumulation prevention
- `/.nogit/debug/connection-accumulation-debug.ts` - Debug script for connection states
- `/.nogit/debug/connection-accumulation-keepalive.ts` - Keep-alive specific tests
- `/.nogit/debug/connection-accumulation-http.ts` - HTTP traffic through proxy chains
### Summary
**Issue Identified**: Connection accumulation occurs on the **inner proxy** (not outer) when backends are unreachable.
**Root Cause**: The `createSocketWithErrorHandler` function in socket-utils.ts doesn't implement connection establishment timeout. It only sets `socket.setTimeout()` which handles inactivity AFTER connection is established, not during the connect phase.
**Impact**: When connecting to unreachable IPs (e.g., 10.255.255.1), outgoing sockets remain in "opening" state indefinitely, causing connections to accumulate.
**Fix Required**:
1. Add `connectionTimeout` setting to ISmartProxyOptions
2. Implement proper connection timeout in `createSocketWithErrorHandler`
3. Pass the timeout value from route-connection-handler
**Workaround Until Fixed**: Configure shorter socket timeouts and use the periodic force cleanup suggested above.
The connection cleanup mechanisms have been significantly improved in v19.5.20:
1. Race condition fixed by setting `record.outgoing` before connecting
2. Immediate routing cleanup handler always destroys outgoing connections
3. Tests confirm no accumulation in standard scenarios with reachable backends
2025-06-07 10:55:59 +00:00
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
### Outer Proxy Sudden Accumulation After Hours
**User Report**: "The counter goes up suddenly after some hours on the outer proxy"
**Investigation Findings**:
1. **Cleanup Queue Mechanism** :
- Connections are cleaned up in batches of 100 via a queue
- If the cleanup timer gets stuck or cleared without restart, connections accumulate
- The timer is set with `setTimeout` and could be affected by event loop blocking
2. **Potential Causes for Sudden Spikes** :
a) **Cleanup Timer Failure** :
```typescript
// In ConnectionManager, if this timer gets cleared but not restarted:
this.cleanupTimer = this.setTimeout(() => {
this.processCleanupQueue();
}, 100);
```
b) **Memory Pressure** :
- After hours of operation, memory fragmentation or pressure could cause delays
- Garbage collection pauses might interfere with timer execution
c) **Event Listener Accumulation** :
- Socket event listeners might accumulate over time
- Server 'connection' event handlers are particularly important
d) **Keep-Alive Connection Cascades** :
- When many keep-alive connections timeout simultaneously
- Outer proxy has different timeout than inner proxy
- Mass disconnection events can overwhelm cleanup queue
e) **HttpProxy Component Issues** :
- If using `useHttpProxy` , the HttpProxy bridge might maintain connection pools
- These pools might not be properly cleaned after hours
3. **Why "Sudden" After Hours** :
- Not a gradual leak but triggered by specific conditions
- Likely related to periodic events or thresholds:
- Inactivity check runs every 30 seconds
- Keep-alive connections have extended timeouts (6x normal)
- Parity check has 30-minute timeout for half-closed connections
4. **Reproduction Scenarios** :
- Mass client disconnection/reconnection (network blip)
- Keep-alive timeout cascade when inner proxy times out first
- Cleanup timer getting stuck during high load
- Memory pressure causing event loop delays
### Additional Monitoring Recommendations
1. **Add Cleanup Queue Monitoring** :
```typescript
setInterval(() => {
const cm = proxy.connectionManager;
if (cm.cleanupQueue.size > 100 & & !cm.cleanupTimer) {
logger.error('Cleanup queue stuck!', {
queueSize: cm.cleanupQueue.size,
hasTimer: !!cm.cleanupTimer
});
}
}, 60000);
```
2. **Track Timer Health** :
- Monitor if cleanup timer is running
- Check for event loop blocking
- Log when batch processing takes too long
3. **Memory Monitoring** :
- Track heap usage over time
- Monitor for memory leaks in long-running processes
- Force periodic garbage collection if needed
### Immediate Mitigations
1. **Restart Cleanup Timer** :
```typescript
// Emergency cleanup timer restart
if (!cm.cleanupTimer & & cm.cleanupQueue.size > 0) {
cm.cleanupTimer = setTimeout(() => {
cm.processCleanupQueue();
}, 100);
}
```
2. **Force Periodic Cleanup** :
```typescript
setInterval(() => {
const cm = connectionManager;
if (cm.getConnectionCount() > threshold) {
cm.performOptimizedInactivityCheck();
// Force process cleanup queue
cm.processCleanupQueue();
}
}, 300000); // Every 5 minutes
```
3. **Connection Age Limits** :
- Set maximum connection lifetime
- Force close connections older than threshold
- More aggressive cleanup for proxy chains
## ✅ FIXED: Zombie Connection Detection (January 2025)
### Root Cause Identified
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false` . This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.
### Fix Implemented
Added zombie detection to the periodic inactivity check in ConnectionManager:
```typescript
// In performOptimizedInactivityCheck()
// Check ALL connections for zombie state
for (const [connectionId, record] of this.connectionRecords) {
if (!record.connectionClosed) {
const incomingDestroyed = record.incoming?.destroyed || false;
const outgoingDestroyed = record.outgoing?.destroyed || false;
// Check for zombie connections: both sockets destroyed but not cleaned up
if (incomingDestroyed & & outgoingDestroyed) {
logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up` , {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(now - record.incomingStartTime),
component: 'connection-manager'
});
// Clean up immediately
this.cleanupConnection(record, 'zombie_cleanup');
continue;
}
// Check for half-zombie: one socket destroyed
if (incomingDestroyed || outgoingDestroyed) {
const age = now - record.incomingStartTime;
// Give it 30 seconds grace period for normal cleanup
if (age > 30000) {
logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed` , {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
incomingDestroyed,
outgoingDestroyed,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'half_zombie_cleanup');
}
}
}
}
```
### How It Works
1. **Full Zombie Detection** : Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
2. **Half-Zombie Detection** : Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
3. **Automatic Cleanup** : Immediately cleans up zombie connections when detected
4. **Runs Periodically** : Integrated into the existing inactivity check that runs every 30 seconds
### Why This Fixes the Outer Proxy Accumulation
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
- These become zombie connections that previously accumulated indefinitely
- Now they are detected and cleaned up within 30 seconds
### Test Results
Debug scripts confirmed:
- Zombie connections can be created when sockets are destroyed directly without events
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
2025-06-07 20:37:49 +00:00
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
## 🔍 Production Diagnostics (January 2025)
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
### How to Use the Production Monitor
1. **Add to your proxy startup script** :
```typescript
import ProductionConnectionMonitor from './production-connection-monitor.js';
// After proxy.start()
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// Monitor will automatically capture diagnostics when:
// - Connections exceed threshold (default: 50)
// - Sudden spike occurs (default: +20 connections)
```
2. **Diagnostics are saved to** : `.nogit/connection-diagnostics/`
3. **Force capture anytime** : `monitor.forceCaptureNow()`
### What the Monitor Captures
For each connection:
- Socket states (destroyed, readable, writable, readyState)
- Connection flags (closed, keepAlive, TLS status)
- Data transfer statistics
- Time since last activity
- Cleanup queue status
- Event listener counts
- Termination reasons
### Pattern Analysis
The monitor automatically identifies:
- **Zombie connections**: Both sockets destroyed but not cleaned up
- **Half-zombies**: One socket destroyed
- **Stuck connecting**: Outgoing socket stuck in connecting state
- **No outgoing**: Missing outgoing socket
- **Keep-alive stuck**: Keep-alive connections with no recent activity
- **Old connections**: Connections older than 1 hour
- **No data transfer**: Connections with no bytes transferred
- **Listener leaks**: Excessive event listeners
### Common Accumulation Patterns
1. **Connecting State Stuck**
- Outgoing socket shows `connecting: true` indefinitely
- Usually means connection timeout not working
- Check if backend is reachable
2. **Missing Outgoing Socket**
- Connection has no outgoing socket but isn't closed
- May indicate immediate routing issues
- Check error logs during connection setup
3. **Event Listener Accumulation**
- High listener counts (>20) on sockets
- Indicates cleanup not removing all listeners
- Can cause memory leaks
4. **Keep-Alive Zombies**
- Keep-alive connections not timing out
- Check keepAlive timeout settings
- May need more aggressive cleanup
### Next Steps
1. **Run the monitor in production** during accumulation
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
3. **Look for patterns** in the captured snapshots
4. **Check specific connection IDs** that accumulate
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
## ✅ FIXED: Stuck Connection Detection (January 2025)
### Additional Root Cause Found
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
- Both sockets remain alive (not destroyed)
- Keep-alive prevents normal timeout
- No data is sent back to the client despite receiving data
- These don't qualify as "zombies" since sockets aren't destroyed
### Fix Implemented
Added stuck connection detection to the periodic inactivity check:
```typescript
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed & & record.outgoing & & record.bytesReceived > 0 & & record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes` , {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
```
### What This Fixes
- Connections to backends that accept but never respond
- Proxy chains where inner proxy connects to unresponsive services
- Scenarios where keep-alive prevents normal timeout mechanisms
- Connections that receive client data but never send anything back
### Detection Criteria
- Connection has received bytes from client (`bytesReceived > 0` )
- No bytes sent back to client (`bytesSent === 0` )
- Connection is older than 60 seconds
- Both sockets are still alive (not destroyed)
2025-06-08 12:25:31 +00:00
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.
## 🚨 CRITICAL FIX: Cleanup Queue Bug (January 2025)
### Critical Bug Found
The cleanup queue had a severe bug that caused connection accumulation when more than 100 connections needed cleanup:
```typescript
// BUG: This cleared the ENTIRE queue after processing only the first batch!
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
this.cleanupQueue.clear(); // ❌ This discarded all connections beyond the first 100!
```
### Fix Implemented
```typescript
// Now only removes the connections being processed
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
for (const connectionId of toCleanup) {
this.cleanupQueue.delete(connectionId); // ✅ Only remove what we process
const record = this.connectionRecords.get(connectionId);
if (record) {
this.cleanupConnection(record, record.incomingTerminationReason || 'normal');
}
}
```
### Impact
- **Before**: If 150 connections needed cleanup, only the first 100 would be processed and the remaining 50 would accumulate forever
- **After**: All connections are properly cleaned up in batches
### Additional Improvements
1. **Faster Inactivity Checks** : Reduced from 30s to 10s intervals
- Zombies and stuck connections are detected 3x faster
- Reduces the window for accumulation
2. **Duplicate Prevention** : Added check in queueCleanup to prevent processing already-closed connections
- Prevents unnecessary work
- Ensures connections are only cleaned up once
### Summary of All Fixes
1. **Connection Timeout** (already documented) - Prevents accumulation when backends are unreachable
2. **Zombie Detection** - Cleans up connections with destroyed sockets
3. **Stuck Connection Detection** - Cleans up connections to hanging backends
4. **Cleanup Queue Bug** - Ensures ALL connections get cleaned up, not just the first 100
5. **Faster Detection** - Reduced check interval from 30s to 10s
These fixes combined should prevent connection accumulation in all known scenarios.