smartproxy/readme.connections.md

# Connection Management in SmartProxy

This document describes connection handling, cleanup mechanisms, and known issues in SmartProxy, particularly focusing on proxy chain configurations.

## Connection Accumulation Investigation (January 2025)

### Problem Statement
Connections may accumulate on the outer proxy in proxy chain configurations, despite implemented fixes.

### Historical Context
- **v19.5.12-v19.5.15**: Major connection cleanup improvements
- **v19.5.19+**: PROXY protocol support with WrappedSocket implementation
- **v19.5.20**: Fixed race condition in immediate routing cleanup

### Current Architecture

#### Connection Flow in Proxy Chains
```
Client → Outer Proxy (8001) → Inner Proxy (8002) → Backend (httpbin.org:443)
```

1. **Outer Proxy**:
   - Accepts client connection
   - Sends PROXY protocol header to inner proxy
   - Tracks connection in ConnectionManager
   - Immediate routing for non-TLS ports

2. **Inner Proxy**:
   - Parses PROXY protocol to get real client IP
   - Establishes connection to backend
   - Tracks its own connections separately

### Potential Causes of Connection Accumulation

#### 1. Race Condition in Immediate Routing
When a connection is immediately routed (non-TLS ports), there's a timing window:
```typescript
// route-connection-handler.ts, line ~231
this.routeConnection(socket, record, '', undefined);
// Connection is routed before all setup is complete
```

**Issue**: If client disconnects during backend connection setup, cleanup may not trigger properly.

#### 2. Outgoing Socket Assignment Timing
Despite the fix in v19.5.20:
```typescript
// Line 1362 in setupDirectConnection
record.outgoing = targetSocket;
```
There's still a window between socket creation and the `connect` event where cleanup might miss the outgoing socket.

#### 3. Batch Cleanup Delays
ConnectionManager uses queued cleanup:
- Batch size: 100 connections
- Batch interval: 100ms
- Under rapid connection/disconnection, queue might lag

#### 4. Different Cleanup Paths
Multiple cleanup triggers exist:
- Socket 'close' event
- Socket 'error' event
- Inactivity timeout
- Connection timeout
- Manual cleanup

Not all paths may properly handle proxy chain scenarios.

#### 5. Keep-Alive Connection Handling
Keep-alive connections have special treatment:
- Extended inactivity timeout (6x normal)
- Warning before closure
- May accumulate if backend is unresponsive

### Observed Symptoms

1. **Outer proxy connection count grows over time**
2. **Inner proxy maintains zero or low connection count**
3. **Connections show as closed in logs but remain in tracking**
4. **Memory usage gradually increases**

### Debug Strategies

#### 1. Enhanced Logging
Add connection state logging at key points:
```typescript
// When outgoing socket is created
logger.log('debug', `Outgoing socket created for ${connectionId}`, {
  hasOutgoing: !!record.outgoing,
  outgoingState: record.outgoing?.readyState
});
```

#### 2. Connection State Inspection
Periodically log detailed connection state:
```typescript
for (const [id, record] of connectionManager.getConnections()) {
  console.log({
    id,
    age: Date.now() - record.incomingStartTime,
    incomingDestroyed: record.incoming.destroyed,
    outgoingDestroyed: record.outgoing?.destroyed,
    hasCleanupTimer: !!record.cleanupTimer
  });
}
```

#### 3. Cleanup Verification
Track cleanup completion:
```typescript
// In cleanupConnection
logger.log('debug', `Cleanup completed for ${record.id}`, {
  recordsRemaining: this.connectionRecords.size
});
```

### Recommendations

1. **Immediate Cleanup for Proxy Chains**
   - Skip batch queue for proxy chain connections
   - Use synchronous cleanup when PROXY protocol is detected

2. **Socket State Validation**
   - Check both `destroyed` and `readyState` before cleanup decisions
   - Handle 'opening' state sockets explicitly

3. **Timeout Adjustments**
   - Shorter timeouts for proxy chain connections
   - More aggressive cleanup for connections without data transfer

4. **Connection Limits**
   - Per-route connection limits
   - Backpressure when approaching limits

5. **Monitoring**
   - Export connection metrics
   - Alert on connection count thresholds
   - Track connection age distribution

### Test Scenarios to Reproduce

1. **Rapid Connect/Disconnect**
   ```bash
   # Create many short-lived connections
   for i in {1..1000}; do
     (echo -n | nc localhost 8001) &
   done
   ```

2. **Slow Backend**
   - Configure inner proxy to connect to unresponsive backend
   - Monitor outer proxy connection count

3. **Mixed Traffic**
   - Combine TLS and non-TLS connections
   - Add keep-alive connections
   - Observe accumulation patterns

### Future Improvements

1. **Connection Pool Isolation**
   - Separate pools for proxy chain vs direct connections
   - Different cleanup strategies per pool

2. **Circuit Breaker**
   - Detect accumulation and trigger aggressive cleanup
   - Temporary refuse new connections when near limit

3. **Connection State Machine**
   - Explicit states: CONNECTING, ESTABLISHED, CLOSING, CLOSED
   - State transition validation
   - Timeout per state

4. **Metrics Collection**
   - Connection lifecycle events
   - Cleanup success/failure rates
   - Time spent in each state

### Root Cause Identified (January 2025)

**The primary issue is on the inner proxy when backends are unreachable:**

When the backend is unreachable (e.g., non-routable IP like 10.255.255.1):
1. The outgoing socket gets stuck in "opening" state indefinitely
2. The `createSocketWithErrorHandler` in socket-utils.ts doesn't implement connection timeout
3. `socket.setTimeout()` only handles inactivity AFTER connection, not during connect phase
4. Connections accumulate because they never transition to error state
5. Socket timeout warnings fire but connections are preserved as keep-alive

**Code Issue:**
```typescript
// socket-utils.ts line 275
if (timeout) {
  socket.setTimeout(timeout);  // This only handles inactivity, not connection!
}
```

**Required Fix:**

1. Add `connectionTimeout` to ISmartProxyOptions interface:
```typescript
// In interfaces.ts
connectionTimeout?: number; // Timeout for establishing connection (ms), default: 30000 (30s)
```

2. Update `createSocketWithErrorHandler` in socket-utils.ts:
```typescript
export function createSocketWithErrorHandler(options: SafeSocketOptions): plugins.net.Socket {
  const { port, host, onError, onConnect, timeout } = options;

  const socket = new plugins.net.Socket();
  let connected = false;
  let connectionTimeout: NodeJS.Timeout | null = null;

  socket.on('error', (error) => {
    if (connectionTimeout) {
      clearTimeout(connectionTimeout);
      connectionTimeout = null;
    }
    if (onError) onError(error);
  });

  socket.on('connect', () => {
    connected = true;
    if (connectionTimeout) {
      clearTimeout(connectionTimeout);
      connectionTimeout = null;
    }
    if (timeout) socket.setTimeout(timeout); // Set inactivity timeout
    if (onConnect) onConnect();
  });

  // Implement connection establishment timeout
  if (timeout) {
    connectionTimeout = setTimeout(() => {
      if (!connected && !socket.destroyed) {
        const error = new Error(`Connection timeout after ${timeout}ms to ${host}:${port}`);
        (error as any).code = 'ETIMEDOUT';
        socket.destroy();
        if (onError) onError(error);
      }
    }, timeout);
  }

  socket.connect(port, host);
  return socket;
}
```

3. Pass connectionTimeout in route-connection-handler.ts:
```typescript
const targetSocket = createSocketWithErrorHandler({
  port: finalTargetPort,
  host: finalTargetHost,
  timeout: this.settings.connectionTimeout || 30000, // Connection timeout
  onError: (error) => { /* existing */ },
  onConnect: async () => { /* existing */ }
});
```

### Investigation Results (January 2025)

Based on extensive testing with debug scripts:

1. **Normal Operation**: In controlled tests, connections are properly cleaned up:
   - Immediate routing cleanup handler properly destroys outgoing connections
   - Both outer and inner proxies maintain 0 connections after clients disconnect
   - Keep-alive connections are tracked and cleaned up correctly

2. **Potential Edge Cases Not Covered by Tests**:
   - **HTTP/2 Connections**: May have different lifecycle than HTTP/1.1
   - **WebSocket Connections**: Long-lived upgrade connections might persist
   - **Partial TLS Handshakes**: Connections that start TLS but don't complete
   - **PROXY Protocol Parse Failures**: Malformed headers from untrusted sources
   - **Connection Pool Reuse**: HttpProxy component may maintain its own pools

3. **Timing-Sensitive Scenarios**:
   - Client disconnects exactly when `record.outgoing` is being assigned
   - Backend connects but immediately RSTs
   - Proxy chain where middle proxy restarts
   - Multiple rapid reconnects with same source IP/port

4. **Configuration-Specific Issues**:
   - Mixed `sendProxyProtocol` settings in chain
   - Different `keepAlive` settings between proxies
   - Mismatched timeout values
   - Routes with `forwardingEngine: 'nftables'`

### Additional Debug Points

Add these debug logs to identify the specific scenario:

```typescript
// In route-connection-handler.ts setupDirectConnection
logger.log('debug', `Setting outgoing socket for ${connectionId}`, {
  timestamp: Date.now(),
  hasOutgoing: !!record.outgoing,
  socketState: targetSocket.readyState
});

// In connection-manager.ts cleanupConnection
logger.log('debug', `Cleanup attempt for ${record.id}`, {
  alreadyClosed: record.connectionClosed,
  hasIncoming: !!record.incoming,
  hasOutgoing: !!record.outgoing,
  incomingDestroyed: record.incoming?.destroyed,
  outgoingDestroyed: record.outgoing?.destroyed
});
```

### Workarounds

Until root cause is identified:

1. **Periodic Force Cleanup**:
   ```typescript
   setInterval(() => {
     const connections = connectionManager.getConnections();
     for (const [id, record] of connections) {
       if (record.incoming?.destroyed && !record.connectionClosed) {
         connectionManager.cleanupConnection(record, 'force_cleanup');
       }
     }
   }, 60000); // Every minute
   ```

2. **Connection Age Limit**:
   ```typescript
   // Add max connection age check
   const maxAge = 3600000; // 1 hour
   if (Date.now() - record.incomingStartTime > maxAge) {
     connectionManager.cleanupConnection(record, 'max_age');
   }
   ```

3. **Aggressive Timeout Settings**:
   ```typescript
   {
     socketTimeout: 60000,        // 1 minute
     inactivityTimeout: 300000,   // 5 minutes
     connectionCleanupInterval: 30000  // 30 seconds
   }
   ```

### Related Files
- `/ts/proxies/smart-proxy/route-connection-handler.ts` - Main connection handling
- `/ts/proxies/smart-proxy/connection-manager.ts` - Connection tracking and cleanup
- `/ts/core/utils/socket-utils.ts` - Socket cleanup utilities
- `/test/test.proxy-chain-cleanup.node.ts` - Test for connection cleanup
- `/test/test.proxy-chaining-accumulation.node.ts` - Test for accumulation prevention
- `/.nogit/debug/connection-accumulation-debug.ts` - Debug script for connection states
- `/.nogit/debug/connection-accumulation-keepalive.ts` - Keep-alive specific tests
- `/.nogit/debug/connection-accumulation-http.ts` - HTTP traffic through proxy chains

### Summary

**Issue Identified**: Connection accumulation occurs on the **inner proxy** (not outer) when backends are unreachable.

**Root Cause**: The `createSocketWithErrorHandler` function in socket-utils.ts doesn't implement connection establishment timeout. It only sets `socket.setTimeout()` which handles inactivity AFTER connection is established, not during the connect phase.

**Impact**: When connecting to unreachable IPs (e.g., 10.255.255.1), outgoing sockets remain in "opening" state indefinitely, causing connections to accumulate.

**Fix Required**:
1. Add `connectionTimeout` setting to ISmartProxyOptions
2. Implement proper connection timeout in `createSocketWithErrorHandler`
3. Pass the timeout value from route-connection-handler

**Workaround Until Fixed**: Configure shorter socket timeouts and use the periodic force cleanup suggested above.

The connection cleanup mechanisms have been significantly improved in v19.5.20:
1. Race condition fixed by setting `record.outgoing` before connecting
2. Immediate routing cleanup handler always destroys outgoing connections
3. Tests confirm no accumulation in standard scenarios with reachable backends

However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.

### Outer Proxy Sudden Accumulation After Hours

**User Report**: "The counter goes up suddenly after some hours on the outer proxy"

**Investigation Findings**:

1. **Cleanup Queue Mechanism**:
   - Connections are cleaned up in batches of 100 via a queue
   - If the cleanup timer gets stuck or cleared without restart, connections accumulate
   - The timer is set with `setTimeout` and could be affected by event loop blocking

2. **Potential Causes for Sudden Spikes**:

   a) **Cleanup Timer Failure**:
   ```typescript
   // In ConnectionManager, if this timer gets cleared but not restarted:
   this.cleanupTimer = this.setTimeout(() => {
     this.processCleanupQueue();
   }, 100);
   ```

   b) **Memory Pressure**:
   - After hours of operation, memory fragmentation or pressure could cause delays
   - Garbage collection pauses might interfere with timer execution

   c) **Event Listener Accumulation**:
   - Socket event listeners might accumulate over time
   - Server 'connection' event handlers are particularly important

   d) **Keep-Alive Connection Cascades**:
   - When many keep-alive connections timeout simultaneously
   - Outer proxy has different timeout than inner proxy
   - Mass disconnection events can overwhelm cleanup queue

   e) **HttpProxy Component Issues**:
   - If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
   - These pools might not be properly cleaned after hours

3. **Why "Sudden" After Hours**:
   - Not a gradual leak but triggered by specific conditions
   - Likely related to periodic events or thresholds:
     - Inactivity check runs every 30 seconds
     - Keep-alive connections have extended timeouts (6x normal)
     - Parity check has 30-minute timeout for half-closed connections

4. **Reproduction Scenarios**:
   - Mass client disconnection/reconnection (network blip)
   - Keep-alive timeout cascade when inner proxy times out first
   - Cleanup timer getting stuck during high load
   - Memory pressure causing event loop delays

### Additional Monitoring Recommendations

1. **Add Cleanup Queue Monitoring**:
   ```typescript
   setInterval(() => {
     const cm = proxy.connectionManager;
     if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
       logger.error('Cleanup queue stuck!', {
         queueSize: cm.cleanupQueue.size,
         hasTimer: !!cm.cleanupTimer
       });
     }
   }, 60000);
   ```

2. **Track Timer Health**:
   - Monitor if cleanup timer is running
   - Check for event loop blocking
   - Log when batch processing takes too long

3. **Memory Monitoring**:
   - Track heap usage over time
   - Monitor for memory leaks in long-running processes
   - Force periodic garbage collection if needed

### Immediate Mitigations

1. **Restart Cleanup Timer**:
   ```typescript
   // Emergency cleanup timer restart
   if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
     cm.cleanupTimer = setTimeout(() => {
       cm.processCleanupQueue();
     }, 100);
   }
   ```

2. **Force Periodic Cleanup**:
   ```typescript
   setInterval(() => {
     const cm = connectionManager;
     if (cm.getConnectionCount() > threshold) {
       cm.performOptimizedInactivityCheck();
       // Force process cleanup queue
       cm.processCleanupQueue();
     }
   }, 300000); // Every 5 minutes
   ```

3. **Connection Age Limits**:
   - Set maximum connection lifetime
   - Force close connections older than threshold
   - More aggressive cleanup for proxy chains

## ✅ FIXED: Zombie Connection Detection (January 2025)

### Root Cause Identified
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.

### Fix Implemented
Added zombie detection to the periodic inactivity check in ConnectionManager:

```typescript
// In performOptimizedInactivityCheck()
// Check ALL connections for zombie state
for (const [connectionId, record] of this.connectionRecords) {
  if (!record.connectionClosed) {
    const incomingDestroyed = record.incoming?.destroyed || false;
    const outgoingDestroyed = record.outgoing?.destroyed || false;

    // Check for zombie connections: both sockets destroyed but not cleaned up
    if (incomingDestroyed && outgoingDestroyed) {
      logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
        connectionId,
        remoteIP: record.remoteIP,
        age: plugins.prettyMs(now - record.incomingStartTime),
        component: 'connection-manager'
      });

      // Clean up immediately
      this.cleanupConnection(record, 'zombie_cleanup');
      continue;
    }

    // Check for half-zombie: one socket destroyed
    if (incomingDestroyed || outgoingDestroyed) {
      const age = now - record.incomingStartTime;
      // Give it 30 seconds grace period for normal cleanup
      if (age > 30000) {
        logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
          connectionId,
          remoteIP: record.remoteIP,
          age: plugins.prettyMs(age),
          incomingDestroyed,
          outgoingDestroyed,
          component: 'connection-manager'
        });

        // Clean up
        this.cleanupConnection(record, 'half_zombie_cleanup');
      }
    }
  }
}
```

### How It Works
1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds

### Why This Fixes the Outer Proxy Accumulation
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
- These become zombie connections that previously accumulated indefinitely
- Now they are detected and cleaned up within 30 seconds

### Test Results
Debug scripts confirmed:
- Zombie connections can be created when sockets are destroyed directly without events
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled

This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.