smartproxy/readme.connections.md

# Connection Management in SmartProxy

This document describes connection handling, cleanup mechanisms, and known issues in SmartProxy, particularly focusing on proxy chain configurations.

## Connection Accumulation Investigation (January 2025)

### Problem Statement
Connections may accumulate on the outer proxy in proxy chain configurations, despite implemented fixes.

### Historical Context
- **v19.5.12-v19.5.15**: Major connection cleanup improvements
- **v19.5.19+**: PROXY protocol support with WrappedSocket implementation
- **v19.5.20**: Fixed race condition in immediate routing cleanup

### Current Architecture

#### Connection Flow in Proxy Chains
```
Client → Outer Proxy (8001) → Inner Proxy (8002) → Backend (httpbin.org:443)
```

1. **Outer Proxy**:
   - Accepts client connection
   - Sends PROXY protocol header to inner proxy
   - Tracks connection in ConnectionManager
   - Immediate routing for non-TLS ports

2. **Inner Proxy**:
   - Parses PROXY protocol to get real client IP
   - Establishes connection to backend
   - Tracks its own connections separately

### Potential Causes of Connection Accumulation

#### 1. Race Condition in Immediate Routing
When a connection is immediately routed (non-TLS ports), there's a timing window:
```typescript
// route-connection-handler.ts, line ~231
this.routeConnection(socket, record, '', undefined);
// Connection is routed before all setup is complete
```

**Issue**: If client disconnects during backend connection setup, cleanup may not trigger properly.

#### 2. Outgoing Socket Assignment Timing
Despite the fix in v19.5.20:
```typescript
// Line 1362 in setupDirectConnection
record.outgoing = targetSocket;
```
There's still a window between socket creation and the `connect` event where cleanup might miss the outgoing socket.

#### 3. Batch Cleanup Delays
ConnectionManager uses queued cleanup:
- Batch size: 100 connections
- Batch interval: 100ms
- Under rapid connection/disconnection, queue might lag

#### 4. Different Cleanup Paths
Multiple cleanup triggers exist:
- Socket 'close' event
- Socket 'error' event  
- Inactivity timeout
- Connection timeout
- Manual cleanup

Not all paths may properly handle proxy chain scenarios.

#### 5. Keep-Alive Connection Handling
Keep-alive connections have special treatment:
- Extended inactivity timeout (6x normal)
- Warning before closure
- May accumulate if backend is unresponsive

### Observed Symptoms

1. **Outer proxy connection count grows over time**
2. **Inner proxy maintains zero or low connection count**
3. **Connections show as closed in logs but remain in tracking**
4. **Memory usage gradually increases**

### Debug Strategies

#### 1. Enhanced Logging
Add connection state logging at key points:
```typescript
// When outgoing socket is created
logger.log('debug', `Outgoing socket created for ${connectionId}`, {
  hasOutgoing: !!record.outgoing,
  outgoingState: record.outgoing?.readyState
});
```

#### 2. Connection State Inspection
Periodically log detailed connection state:
```typescript
for (const [id, record] of connectionManager.getConnections()) {
  console.log({
    id,
    age: Date.now() - record.incomingStartTime,
    incomingDestroyed: record.incoming.destroyed,
    outgoingDestroyed: record.outgoing?.destroyed,
    hasCleanupTimer: !!record.cleanupTimer
  });
}
```

#### 3. Cleanup Verification
Track cleanup completion:
```typescript
// In cleanupConnection
logger.log('debug', `Cleanup completed for ${record.id}`, {
  recordsRemaining: this.connectionRecords.size
});
```

### Recommendations

1. **Immediate Cleanup for Proxy Chains**
   - Skip batch queue for proxy chain connections
   - Use synchronous cleanup when PROXY protocol is detected

2. **Socket State Validation**
   - Check both `destroyed` and `readyState` before cleanup decisions
   - Handle 'opening' state sockets explicitly

3. **Timeout Adjustments**
   - Shorter timeouts for proxy chain connections
   - More aggressive cleanup for connections without data transfer

4. **Connection Limits**
   - Per-route connection limits
   - Backpressure when approaching limits

5. **Monitoring**
   - Export connection metrics
   - Alert on connection count thresholds
   - Track connection age distribution

### Test Scenarios to Reproduce

1. **Rapid Connect/Disconnect**
   ```bash
   # Create many short-lived connections
   for i in {1..1000}; do
     (echo -n | nc localhost 8001) &
   done
   ```

2. **Slow Backend**
   - Configure inner proxy to connect to unresponsive backend
   - Monitor outer proxy connection count

3. **Mixed Traffic**
   - Combine TLS and non-TLS connections
   - Add keep-alive connections
   - Observe accumulation patterns

### Future Improvements

1. **Connection Pool Isolation**
   - Separate pools for proxy chain vs direct connections
   - Different cleanup strategies per pool

2. **Circuit Breaker**
   - Detect accumulation and trigger aggressive cleanup
   - Temporary refuse new connections when near limit

3. **Connection State Machine**
   - Explicit states: CONNECTING, ESTABLISHED, CLOSING, CLOSED
   - State transition validation
   - Timeout per state

4. **Metrics Collection**
   - Connection lifecycle events
   - Cleanup success/failure rates
   - Time spent in each state

### Root Cause Identified (January 2025)

**The primary issue is on the inner proxy when backends are unreachable:**

When the backend is unreachable (e.g., non-routable IP like 10.255.255.1):
1. The outgoing socket gets stuck in "opening" state indefinitely
2. The `createSocketWithErrorHandler` in socket-utils.ts doesn't implement connection timeout
3. `socket.setTimeout()` only handles inactivity AFTER connection, not during connect phase
4. Connections accumulate because they never transition to error state
5. Socket timeout warnings fire but connections are preserved as keep-alive

**Code Issue:**
```typescript
// socket-utils.ts line 275
if (timeout) {
  socket.setTimeout(timeout);  // This only handles inactivity, not connection!
}
```

**Required Fix:**

1. Add `connectionTimeout` to ISmartProxyOptions interface:
```typescript
// In interfaces.ts
connectionTimeout?: number; // Timeout for establishing connection (ms), default: 30000 (30s)
```

2. Update `createSocketWithErrorHandler` in socket-utils.ts:
```typescript
export function createSocketWithErrorHandler(options: SafeSocketOptions): plugins.net.Socket {
  const { port, host, onError, onConnect, timeout } = options;
  
  const socket = new plugins.net.Socket();
  let connected = false;
  let connectionTimeout: NodeJS.Timeout | null = null;
  
  socket.on('error', (error) => {
    if (connectionTimeout) {
      clearTimeout(connectionTimeout);
      connectionTimeout = null;
    }
    if (onError) onError(error);
  });
  
  socket.on('connect', () => {
    connected = true;
    if (connectionTimeout) {
      clearTimeout(connectionTimeout);
      connectionTimeout = null;
    }
    if (timeout) socket.setTimeout(timeout); // Set inactivity timeout
    if (onConnect) onConnect();
  });
  
  // Implement connection establishment timeout
  if (timeout) {
    connectionTimeout = setTimeout(() => {
      if (!connected && !socket.destroyed) {
        const error = new Error(`Connection timeout after ${timeout}ms to ${host}:${port}`);
        (error as any).code = 'ETIMEDOUT';
        socket.destroy();
        if (onError) onError(error);
      }
    }, timeout);
  }
  
  socket.connect(port, host);
  return socket;
}
```

3. Pass connectionTimeout in route-connection-handler.ts:
```typescript
const targetSocket = createSocketWithErrorHandler({
  port: finalTargetPort,
  host: finalTargetHost,
  timeout: this.settings.connectionTimeout || 30000, // Connection timeout
  onError: (error) => { /* existing */ },
  onConnect: async () => { /* existing */ }
});
```

### Investigation Results (January 2025)

Based on extensive testing with debug scripts:

1. **Normal Operation**: In controlled tests, connections are properly cleaned up:
   - Immediate routing cleanup handler properly destroys outgoing connections
   - Both outer and inner proxies maintain 0 connections after clients disconnect
   - Keep-alive connections are tracked and cleaned up correctly

2. **Potential Edge Cases Not Covered by Tests**:
   - **HTTP/2 Connections**: May have different lifecycle than HTTP/1.1
   - **WebSocket Connections**: Long-lived upgrade connections might persist
   - **Partial TLS Handshakes**: Connections that start TLS but don't complete
   - **PROXY Protocol Parse Failures**: Malformed headers from untrusted sources
   - **Connection Pool Reuse**: HttpProxy component may maintain its own pools

3. **Timing-Sensitive Scenarios**:
   - Client disconnects exactly when `record.outgoing` is being assigned
   - Backend connects but immediately RSTs
   - Proxy chain where middle proxy restarts
   - Multiple rapid reconnects with same source IP/port

4. **Configuration-Specific Issues**:
   - Mixed `sendProxyProtocol` settings in chain
   - Different `keepAlive` settings between proxies
   - Mismatched timeout values
   - Routes with `forwardingEngine: 'nftables'`

### Additional Debug Points

Add these debug logs to identify the specific scenario:

```typescript
// In route-connection-handler.ts setupDirectConnection
logger.log('debug', `Setting outgoing socket for ${connectionId}`, {
  timestamp: Date.now(),
  hasOutgoing: !!record.outgoing,
  socketState: targetSocket.readyState
});

// In connection-manager.ts cleanupConnection
logger.log('debug', `Cleanup attempt for ${record.id}`, {
  alreadyClosed: record.connectionClosed,
  hasIncoming: !!record.incoming,
  hasOutgoing: !!record.outgoing,
  incomingDestroyed: record.incoming?.destroyed,
  outgoingDestroyed: record.outgoing?.destroyed
});
```

### Workarounds

Until root cause is identified:

1. **Periodic Force Cleanup**:
   ```typescript
   setInterval(() => {
     const connections = connectionManager.getConnections();
     for (const [id, record] of connections) {
       if (record.incoming?.destroyed && !record.connectionClosed) {
         connectionManager.cleanupConnection(record, 'force_cleanup');
       }
     }
   }, 60000); // Every minute
   ```

2. **Connection Age Limit**:
   ```typescript
   // Add max connection age check
   const maxAge = 3600000; // 1 hour
   if (Date.now() - record.incomingStartTime > maxAge) {
     connectionManager.cleanupConnection(record, 'max_age');
   }
   ```

3. **Aggressive Timeout Settings**:
   ```typescript
   {
     socketTimeout: 60000,        // 1 minute
     inactivityTimeout: 300000,   // 5 minutes
     connectionCleanupInterval: 30000  // 30 seconds
   }
   ```

### Related Files
- `/ts/proxies/smart-proxy/route-connection-handler.ts` - Main connection handling
- `/ts/proxies/smart-proxy/connection-manager.ts` - Connection tracking and cleanup
- `/ts/core/utils/socket-utils.ts` - Socket cleanup utilities
- `/test/test.proxy-chain-cleanup.node.ts` - Test for connection cleanup
- `/test/test.proxy-chaining-accumulation.node.ts` - Test for accumulation prevention
- `/.nogit/debug/connection-accumulation-debug.ts` - Debug script for connection states
- `/.nogit/debug/connection-accumulation-keepalive.ts` - Keep-alive specific tests
- `/.nogit/debug/connection-accumulation-http.ts` - HTTP traffic through proxy chains

### Summary

**Issue Identified**: Connection accumulation occurs on the **inner proxy** (not outer) when backends are unreachable.

**Root Cause**: The `createSocketWithErrorHandler` function in socket-utils.ts doesn't implement connection establishment timeout. It only sets `socket.setTimeout()` which handles inactivity AFTER connection is established, not during the connect phase.

**Impact**: When connecting to unreachable IPs (e.g., 10.255.255.1), outgoing sockets remain in "opening" state indefinitely, causing connections to accumulate.

**Fix Required**:
1. Add `connectionTimeout` setting to ISmartProxyOptions
2. Implement proper connection timeout in `createSocketWithErrorHandler`
3. Pass the timeout value from route-connection-handler

**Workaround Until Fixed**: Configure shorter socket timeouts and use the periodic force cleanup suggested above.

The connection cleanup mechanisms have been significantly improved in v19.5.20:
1. Race condition fixed by setting `record.outgoing` before connecting
2. Immediate routing cleanup handler always destroys outgoing connections
3. Tests confirm no accumulation in standard scenarios with reachable backends

However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.

### Outer Proxy Sudden Accumulation After Hours

**User Report**: "The counter goes up suddenly after some hours on the outer proxy"

**Investigation Findings**:

1. **Cleanup Queue Mechanism**:
   - Connections are cleaned up in batches of 100 via a queue
   - If the cleanup timer gets stuck or cleared without restart, connections accumulate
   - The timer is set with `setTimeout` and could be affected by event loop blocking

2. **Potential Causes for Sudden Spikes**:
   
   a) **Cleanup Timer Failure**:
   ```typescript
   // In ConnectionManager, if this timer gets cleared but not restarted:
   this.cleanupTimer = this.setTimeout(() => {
     this.processCleanupQueue();
   }, 100);
   ```
   
   b) **Memory Pressure**:
   - After hours of operation, memory fragmentation or pressure could cause delays
   - Garbage collection pauses might interfere with timer execution
   
   c) **Event Listener Accumulation**:
   - Socket event listeners might accumulate over time
   - Server 'connection' event handlers are particularly important
   
   d) **Keep-Alive Connection Cascades**:
   - When many keep-alive connections timeout simultaneously
   - Outer proxy has different timeout than inner proxy
   - Mass disconnection events can overwhelm cleanup queue
   
   e) **HttpProxy Component Issues**:
   - If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
   - These pools might not be properly cleaned after hours

3. **Why "Sudden" After Hours**:
   - Not a gradual leak but triggered by specific conditions
   - Likely related to periodic events or thresholds:
     - Inactivity check runs every 30 seconds
     - Keep-alive connections have extended timeouts (6x normal)
     - Parity check has 30-minute timeout for half-closed connections
   
4. **Reproduction Scenarios**:
   - Mass client disconnection/reconnection (network blip)
   - Keep-alive timeout cascade when inner proxy times out first
   - Cleanup timer getting stuck during high load
   - Memory pressure causing event loop delays

### Additional Monitoring Recommendations

1. **Add Cleanup Queue Monitoring**:
   ```typescript
   setInterval(() => {
     const cm = proxy.connectionManager;
     if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
       logger.error('Cleanup queue stuck!', {
         queueSize: cm.cleanupQueue.size,
         hasTimer: !!cm.cleanupTimer
       });
     }
   }, 60000);
   ```

2. **Track Timer Health**:
   - Monitor if cleanup timer is running
   - Check for event loop blocking
   - Log when batch processing takes too long

3. **Memory Monitoring**:
   - Track heap usage over time
   - Monitor for memory leaks in long-running processes
   - Force periodic garbage collection if needed

### Immediate Mitigations

1. **Restart Cleanup Timer**:
   ```typescript
   // Emergency cleanup timer restart
   if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
     cm.cleanupTimer = setTimeout(() => {
       cm.processCleanupQueue();
     }, 100);
   }
   ```

2. **Force Periodic Cleanup**:
   ```typescript
   setInterval(() => {
     const cm = connectionManager;
     if (cm.getConnectionCount() > threshold) {
       cm.performOptimizedInactivityCheck();
       // Force process cleanup queue
       cm.processCleanupQueue();
     }
   }, 300000); // Every 5 minutes
   ```

3. **Connection Age Limits**:
   - Set maximum connection lifetime
   - Force close connections older than threshold
   - More aggressive cleanup for proxy chains

## ✅ FIXED: Zombie Connection Detection (January 2025)

### Root Cause Identified
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.

### Fix Implemented
Added zombie detection to the periodic inactivity check in ConnectionManager:

```typescript
// In performOptimizedInactivityCheck()
// Check ALL connections for zombie state
for (const [connectionId, record] of this.connectionRecords) {
  if (!record.connectionClosed) {
    const incomingDestroyed = record.incoming?.destroyed || false;
    const outgoingDestroyed = record.outgoing?.destroyed || false;
    
    // Check for zombie connections: both sockets destroyed but not cleaned up
    if (incomingDestroyed && outgoingDestroyed) {
      logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
        connectionId,
        remoteIP: record.remoteIP,
        age: plugins.prettyMs(now - record.incomingStartTime),
        component: 'connection-manager'
      });
      
      // Clean up immediately
      this.cleanupConnection(record, 'zombie_cleanup');
      continue;
    }
    
    // Check for half-zombie: one socket destroyed
    if (incomingDestroyed || outgoingDestroyed) {
      const age = now - record.incomingStartTime;
      // Give it 30 seconds grace period for normal cleanup
      if (age > 30000) {
        logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
          connectionId,
          remoteIP: record.remoteIP,
          age: plugins.prettyMs(age),
          incomingDestroyed,
          outgoingDestroyed,
          component: 'connection-manager'
        });
        
        // Clean up
        this.cleanupConnection(record, 'half_zombie_cleanup');
      }
    }
  }
}
```

### How It Works
1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds

### Why This Fixes the Outer Proxy Accumulation
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
- These become zombie connections that previously accumulated indefinitely
- Now they are detected and cleaned up within 30 seconds

### Test Results
Debug scripts confirmed:
- Zombie connections can be created when sockets are destroyed directly without events
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled

This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.

## 🔍 Production Diagnostics (January 2025)

Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:

### How to Use the Production Monitor

1. **Add to your proxy startup script**:
```typescript
import ProductionConnectionMonitor from './production-connection-monitor.js';

// After proxy.start()
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds

// Monitor will automatically capture diagnostics when:
// - Connections exceed threshold (default: 50)
// - Sudden spike occurs (default: +20 connections)
```

2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`

3. **Force capture anytime**: `monitor.forceCaptureNow()`

### What the Monitor Captures

For each connection:
- Socket states (destroyed, readable, writable, readyState)
- Connection flags (closed, keepAlive, TLS status)
- Data transfer statistics
- Time since last activity
- Cleanup queue status
- Event listener counts
- Termination reasons

### Pattern Analysis

The monitor automatically identifies:
- **Zombie connections**: Both sockets destroyed but not cleaned up
- **Half-zombies**: One socket destroyed
- **Stuck connecting**: Outgoing socket stuck in connecting state
- **No outgoing**: Missing outgoing socket
- **Keep-alive stuck**: Keep-alive connections with no recent activity
- **Old connections**: Connections older than 1 hour
- **No data transfer**: Connections with no bytes transferred
- **Listener leaks**: Excessive event listeners

### Common Accumulation Patterns

1. **Connecting State Stuck**
   - Outgoing socket shows `connecting: true` indefinitely
   - Usually means connection timeout not working
   - Check if backend is reachable

2. **Missing Outgoing Socket**
   - Connection has no outgoing socket but isn't closed
   - May indicate immediate routing issues
   - Check error logs during connection setup

3. **Event Listener Accumulation**
   - High listener counts (>20) on sockets
   - Indicates cleanup not removing all listeners
   - Can cause memory leaks

4. **Keep-Alive Zombies**
   - Keep-alive connections not timing out
   - Check keepAlive timeout settings
   - May need more aggressive cleanup

### Next Steps

1. **Run the monitor in production** during accumulation
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
3. **Look for patterns** in the captured snapshots
4. **Check specific connection IDs** that accumulate

The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.

## ✅ FIXED: Stuck Connection Detection (January 2025) 

### Additional Root Cause Found
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
- Both sockets remain alive (not destroyed)
- Keep-alive prevents normal timeout
- No data is sent back to the client despite receiving data
- These don't qualify as "zombies" since sockets aren't destroyed

### Fix Implemented
Added stuck connection detection to the periodic inactivity check:

```typescript
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
  const age = now - record.incomingStartTime;
  // If connection is older than 60 seconds and no data sent back, likely stuck
  if (age > 60000) {
    logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
      connectionId,
      remoteIP: record.remoteIP,
      age: plugins.prettyMs(age),
      bytesReceived: record.bytesReceived,
      targetHost: record.targetHost,
      targetPort: record.targetPort,
      component: 'connection-manager'
    });
    
    // Clean up
    this.cleanupConnection(record, 'stuck_no_response');
  }
}
```

### What This Fixes
- Connections to backends that accept but never respond
- Proxy chains where inner proxy connects to unresponsive services
- Scenarios where keep-alive prevents normal timeout mechanisms
- Connections that receive client data but never send anything back

### Detection Criteria
- Connection has received bytes from client (`bytesReceived > 0`)
- No bytes sent back to client (`bytesSent === 0`)
- Connection is older than 60 seconds
- Both sockets are still alive (not destroyed)

This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.

## 🚨 CRITICAL FIX: Cleanup Queue Bug (January 2025)

### Critical Bug Found
The cleanup queue had a severe bug that caused connection accumulation when more than 100 connections needed cleanup:

```typescript
// BUG: This cleared the ENTIRE queue after processing only the first batch!
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
this.cleanupQueue.clear(); // ❌ This discarded all connections beyond the first 100!
```

### Fix Implemented
```typescript
// Now only removes the connections being processed
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
for (const connectionId of toCleanup) {
  this.cleanupQueue.delete(connectionId); // ✅ Only remove what we process
  const record = this.connectionRecords.get(connectionId);
  if (record) {
    this.cleanupConnection(record, record.incomingTerminationReason || 'normal');
  }
}
```

### Impact
- **Before**: If 150 connections needed cleanup, only the first 100 would be processed and the remaining 50 would accumulate forever
- **After**: All connections are properly cleaned up in batches

### Additional Improvements

1. **Faster Inactivity Checks**: Reduced from 30s to 10s intervals
   - Zombies and stuck connections are detected 3x faster
   - Reduces the window for accumulation

2. **Duplicate Prevention**: Added check in queueCleanup to prevent processing already-closed connections
   - Prevents unnecessary work
   - Ensures connections are only cleaned up once

### Summary of All Fixes

1. **Connection Timeout** (already documented) - Prevents accumulation when backends are unreachable
2. **Zombie Detection** - Cleans up connections with destroyed sockets
3. **Stuck Connection Detection** - Cleans up connections to hanging backends
4. **Cleanup Queue Bug** - Ensures ALL connections get cleaned up, not just the first 100
5. **Faster Detection** - Reduced check interval from 30s to 10s

These fixes combined should prevent connection accumulation in all known scenarios.