diff --git a/readme.connections.md b/readme.connections.md index 5673989..b7f419b 100644 --- a/readme.connections.md +++ b/readme.connections.md @@ -372,4 +372,180 @@ The connection cleanup mechanisms have been significantly improved in v19.5.20: 2. Immediate routing cleanup handler always destroys outgoing connections 3. Tests confirm no accumulation in standard scenarios with reachable backends -However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect. \ No newline at end of file +However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect. + +### Outer Proxy Sudden Accumulation After Hours + +**User Report**: "The counter goes up suddenly after some hours on the outer proxy" + +**Investigation Findings**: + +1. **Cleanup Queue Mechanism**: + - Connections are cleaned up in batches of 100 via a queue + - If the cleanup timer gets stuck or cleared without restart, connections accumulate + - The timer is set with `setTimeout` and could be affected by event loop blocking + +2. **Potential Causes for Sudden Spikes**: + + a) **Cleanup Timer Failure**: + ```typescript + // In ConnectionManager, if this timer gets cleared but not restarted: + this.cleanupTimer = this.setTimeout(() => { + this.processCleanupQueue(); + }, 100); + ``` + + b) **Memory Pressure**: + - After hours of operation, memory fragmentation or pressure could cause delays + - Garbage collection pauses might interfere with timer execution + + c) **Event Listener Accumulation**: + - Socket event listeners might accumulate over time + - Server 'connection' event handlers are particularly important + + d) **Keep-Alive Connection Cascades**: + - When many keep-alive connections timeout simultaneously + - Outer proxy has different timeout than inner proxy + - Mass disconnection events can overwhelm cleanup queue + + e) **HttpProxy Component Issues**: + - If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools + - These pools might not be properly cleaned after hours + +3. **Why "Sudden" After Hours**: + - Not a gradual leak but triggered by specific conditions + - Likely related to periodic events or thresholds: + - Inactivity check runs every 30 seconds + - Keep-alive connections have extended timeouts (6x normal) + - Parity check has 30-minute timeout for half-closed connections + +4. **Reproduction Scenarios**: + - Mass client disconnection/reconnection (network blip) + - Keep-alive timeout cascade when inner proxy times out first + - Cleanup timer getting stuck during high load + - Memory pressure causing event loop delays + +### Additional Monitoring Recommendations + +1. **Add Cleanup Queue Monitoring**: + ```typescript + setInterval(() => { + const cm = proxy.connectionManager; + if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) { + logger.error('Cleanup queue stuck!', { + queueSize: cm.cleanupQueue.size, + hasTimer: !!cm.cleanupTimer + }); + } + }, 60000); + ``` + +2. **Track Timer Health**: + - Monitor if cleanup timer is running + - Check for event loop blocking + - Log when batch processing takes too long + +3. **Memory Monitoring**: + - Track heap usage over time + - Monitor for memory leaks in long-running processes + - Force periodic garbage collection if needed + +### Immediate Mitigations + +1. **Restart Cleanup Timer**: + ```typescript + // Emergency cleanup timer restart + if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) { + cm.cleanupTimer = setTimeout(() => { + cm.processCleanupQueue(); + }, 100); + } + ``` + +2. **Force Periodic Cleanup**: + ```typescript + setInterval(() => { + const cm = connectionManager; + if (cm.getConnectionCount() > threshold) { + cm.performOptimizedInactivityCheck(); + // Force process cleanup queue + cm.processCleanupQueue(); + } + }, 300000); // Every 5 minutes + ``` + +3. **Connection Age Limits**: + - Set maximum connection lifetime + - Force close connections older than threshold + - More aggressive cleanup for proxy chains + +## ✅ FIXED: Zombie Connection Detection (January 2025) + +### Root Cause Identified +"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy. + +### Fix Implemented +Added zombie detection to the periodic inactivity check in ConnectionManager: + +```typescript +// In performOptimizedInactivityCheck() +// Check ALL connections for zombie state +for (const [connectionId, record] of this.connectionRecords) { + if (!record.connectionClosed) { + const incomingDestroyed = record.incoming?.destroyed || false; + const outgoingDestroyed = record.outgoing?.destroyed || false; + + // Check for zombie connections: both sockets destroyed but not cleaned up + if (incomingDestroyed && outgoingDestroyed) { + logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, { + connectionId, + remoteIP: record.remoteIP, + age: plugins.prettyMs(now - record.incomingStartTime), + component: 'connection-manager' + }); + + // Clean up immediately + this.cleanupConnection(record, 'zombie_cleanup'); + continue; + } + + // Check for half-zombie: one socket destroyed + if (incomingDestroyed || outgoingDestroyed) { + const age = now - record.incomingStartTime; + // Give it 30 seconds grace period for normal cleanup + if (age > 30000) { + logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, { + connectionId, + remoteIP: record.remoteIP, + age: plugins.prettyMs(age), + incomingDestroyed, + outgoingDestroyed, + component: 'connection-manager' + }); + + // Clean up + this.cleanupConnection(record, 'half_zombie_cleanup'); + } + } + } +} +``` + +### How It Works +1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up +2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur +3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected +4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds + +### Why This Fixes the Outer Proxy Accumulation +- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events +- These become zombie connections that previously accumulated indefinitely +- Now they are detected and cleaned up within 30 seconds + +### Test Results +Debug scripts confirmed: +- Zombie connections can be created when sockets are destroyed directly without events +- The zombie detection successfully identifies and cleans up these connections +- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled + +This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user. \ No newline at end of file diff --git a/readme.hints.md b/readme.hints.md index 4cf87ea..4fdf5ce 100644 --- a/readme.hints.md +++ b/readme.hints.md @@ -856,4 +856,42 @@ The WrappedSocket class has been implemented as the foundation for PROXY protoco For detailed information about proxy protocol implementation and proxy chaining: - **[Proxy Protocol Guide](./readme.proxy-protocol.md)** - Complete implementation details and configuration - **[Proxy Protocol Examples](./readme.proxy-protocol-example.md)** - Code examples and conceptual implementation -- **[Proxy Chain Summary](./readme.proxy-chain-summary.md)** - Quick reference for proxy chaining setup \ No newline at end of file +- **[Proxy Chain Summary](./readme.proxy-chain-summary.md)** - Quick reference for proxy chaining setup + +## Connection Cleanup Edge Cases Investigation (v19.5.20+) + +### Issue Discovered +"Zombie connections" can occur when both sockets are destroyed but the connection record hasn't been cleaned up. This happens when sockets are destroyed without triggering their close/error event handlers. + +### Root Cause +1. **Event Handler Bypass**: In edge cases (network failures, proxy chain failures, forced socket destruction), sockets can be destroyed without their event handlers being called +2. **Cleanup Queue Delay**: The `initiateCleanupOnce` method adds connections to a cleanup queue (batch of 100 every 100ms), which may not process fast enough +3. **Inactivity Check Limitation**: The periodic inactivity check only examines `lastActivity` timestamps, not actual socket states + +### Test Results +Debug script (`connection-manager-direct-test.ts`) revealed: +- **Normal cleanup works**: When socket events fire normally, cleanup is reliable +- **Zombies ARE created**: Direct socket destruction creates zombies (destroyed sockets, connectionClosed=false) +- **Manual cleanup works**: Calling `initiateCleanupOnce` on a zombie does clean it up +- **Inactivity check misses zombies**: The check doesn't detect connections with destroyed sockets + +### Potential Solutions +1. **Periodic Zombie Detection**: Add zombie detection to the inactivity check: + ```typescript + // In performOptimizedInactivityCheck + if (record.incoming?.destroyed && record.outgoing?.destroyed && !record.connectionClosed) { + this.cleanupConnection(record, 'zombie_detected'); + } + ``` + +2. **Socket State Monitoring**: Check socket states during connection operations +3. **Defensive Socket Handling**: Always attach cleanup handlers before any operation that might destroy sockets +4. **Immediate Cleanup Option**: For critical paths, use `cleanupConnection` instead of `initiateCleanupOnce` + +### Impact +- Memory leaks in edge cases (network failures, proxy chain issues) +- Connection count inaccuracy +- Potential resource exhaustion over time + +### Test Files +- `.nogit/debug/connection-manager-direct-test.ts` - Direct ConnectionManager testing showing zombie creation \ No newline at end of file diff --git a/test/test.zombie-connection-cleanup.node.ts b/test/test.zombie-connection-cleanup.node.ts new file mode 100644 index 0000000..53a5bfe --- /dev/null +++ b/test/test.zombie-connection-cleanup.node.ts @@ -0,0 +1,306 @@ +import { tap, expect } from '@git.zone/tstest/tapbundle'; +import * as net from 'net'; +import * as plugins from '../ts/plugins.js'; + +// Import SmartProxy +import { SmartProxy } from '../ts/index.js'; + +// Import types through type-only imports +import type { ConnectionManager } from '../ts/proxies/smart-proxy/connection-manager.js'; +import type { IConnectionRecord } from '../ts/proxies/smart-proxy/models/interfaces.js'; + +tap.test('zombie connection cleanup - verify inactivity check detects and cleans destroyed sockets', async () => { + console.log('\n=== Zombie Connection Cleanup Test ==='); + console.log('Purpose: Verify that connections with destroyed sockets are detected and cleaned up'); + console.log('Setup: Client → OuterProxy (8590) → InnerProxy (8591) → Backend (9998)'); + + // Create backend server that can be controlled + let acceptConnections = true; + let destroyImmediately = false; + const backendConnections: net.Socket[] = []; + + const backend = net.createServer((socket) => { + console.log('Backend: Connection received'); + backendConnections.push(socket); + + if (destroyImmediately) { + console.log('Backend: Destroying connection immediately'); + socket.destroy(); + } else { + socket.on('data', (data) => { + console.log('Backend: Received data, echoing back'); + socket.write(data); + }); + } + }); + + await new Promise((resolve) => { + backend.listen(9998, () => { + console.log('✓ Backend server started on port 9998'); + resolve(); + }); + }); + + // Create InnerProxy with faster inactivity check for testing + const innerProxy = new SmartProxy({ + ports: [8591], + enableDetailedLogging: true, + inactivityTimeout: 5000, // 5 seconds for faster testing + inactivityCheckInterval: 1000, // Check every second + routes: [{ + name: 'to-backend', + match: { ports: 8591 }, + action: { + type: 'forward', + target: { + host: 'localhost', + port: 9998 + } + } + }] + }); + + // Create OuterProxy with faster inactivity check + const outerProxy = new SmartProxy({ + ports: [8590], + enableDetailedLogging: true, + inactivityTimeout: 5000, // 5 seconds for faster testing + inactivityCheckInterval: 1000, // Check every second + routes: [{ + name: 'to-inner', + match: { ports: 8590 }, + action: { + type: 'forward', + target: { + host: 'localhost', + port: 8591 + } + } + }] + }); + + await innerProxy.start(); + console.log('✓ InnerProxy started on port 8591'); + + await outerProxy.start(); + console.log('✓ OuterProxy started on port 8590'); + + // Helper to get connection details + const getConnectionDetails = () => { + const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager; + const innerConnMgr = (innerProxy as any).connectionManager as ConnectionManager; + + const outerRecords = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[]; + const innerRecords = Array.from((innerConnMgr as any).connectionRecords.values()) as IConnectionRecord[]; + + return { + outer: { + count: outerConnMgr.getConnectionCount(), + records: outerRecords, + zombies: outerRecords.filter(r => + !r.connectionClosed && + r.incoming?.destroyed && + (r.outgoing?.destroyed ?? true) + ), + halfZombies: outerRecords.filter(r => + !r.connectionClosed && + (r.incoming?.destroyed || r.outgoing?.destroyed) && + !(r.incoming?.destroyed && (r.outgoing?.destroyed ?? true)) + ) + }, + inner: { + count: innerConnMgr.getConnectionCount(), + records: innerRecords, + zombies: innerRecords.filter(r => + !r.connectionClosed && + r.incoming?.destroyed && + (r.outgoing?.destroyed ?? true) + ), + halfZombies: innerRecords.filter(r => + !r.connectionClosed && + (r.incoming?.destroyed || r.outgoing?.destroyed) && + !(r.incoming?.destroyed && (r.outgoing?.destroyed ?? true)) + ) + } + }; + }; + + console.log('\n--- Test 1: Create zombie by destroying sockets without events ---'); + + // Create a connection and forcefully destroy sockets to create zombies + const client1 = new net.Socket(); + await new Promise((resolve) => { + client1.connect(8590, 'localhost', () => { + console.log('Client1 connected to OuterProxy'); + client1.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n'); + + // Wait for connection to be established through the chain + setTimeout(() => { + console.log('Forcefully destroying backend connections to create zombies'); + + // Get connection details before destruction + const beforeDetails = getConnectionDetails(); + console.log(`Before destruction: Outer=${beforeDetails.outer.count}, Inner=${beforeDetails.inner.count}`); + + // Destroy all backend connections without proper close events + backendConnections.forEach(conn => { + if (!conn.destroyed) { + // Remove all listeners to prevent proper cleanup + conn.removeAllListeners(); + conn.destroy(); + } + }); + + // Also destroy the client socket abruptly + client1.removeAllListeners(); + client1.destroy(); + + resolve(); + }, 500); + }); + }); + + // Check immediately after destruction + await new Promise(resolve => setTimeout(resolve, 100)); + let details = getConnectionDetails(); + console.log(`\nAfter destruction:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + // Wait for inactivity check to run (should detect zombies) + console.log('\nWaiting for inactivity check to detect zombies...'); + await new Promise(resolve => setTimeout(resolve, 2000)); + + details = getConnectionDetails(); + console.log(`\nAfter first inactivity check:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + console.log('\n--- Test 2: Create half-zombie by destroying only one socket ---'); + + // Clear backend connections array + backendConnections.length = 0; + + const client2 = new net.Socket(); + await new Promise((resolve) => { + client2.connect(8590, 'localhost', () => { + console.log('Client2 connected to OuterProxy'); + client2.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n'); + + setTimeout(() => { + console.log('Creating half-zombie by destroying only outgoing socket on outer proxy'); + + // Access the connection records directly + const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager; + const outerRecords = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[]; + + // Find the active connection and destroy only its outgoing socket + const activeRecord = outerRecords.find(r => !r.connectionClosed && r.outgoing && !r.outgoing.destroyed); + if (activeRecord && activeRecord.outgoing) { + console.log('Found active connection, destroying outgoing socket'); + activeRecord.outgoing.removeAllListeners(); + activeRecord.outgoing.destroy(); + } + + resolve(); + }, 500); + }); + }); + + // Check half-zombie state + await new Promise(resolve => setTimeout(resolve, 100)); + details = getConnectionDetails(); + console.log(`\nAfter creating half-zombie:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + // Wait for 30-second grace period (simulated by multiple checks) + console.log('\nWaiting for half-zombie grace period (30 seconds simulated)...'); + + // Manually age the connection to trigger half-zombie cleanup + const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager; + const records = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[]; + records.forEach(record => { + if (!record.connectionClosed) { + // Age the connection by 35 seconds + record.incomingStartTime -= 35000; + } + }); + + // Trigger inactivity check + await new Promise(resolve => setTimeout(resolve, 2000)); + + details = getConnectionDetails(); + console.log(`\nAfter half-zombie cleanup:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + // Clean up client2 properly + if (!client2.destroyed) { + client2.destroy(); + } + + console.log('\n--- Test 3: Rapid zombie creation under load ---'); + + // Create multiple connections rapidly and destroy them + const rapidClients: net.Socket[] = []; + + for (let i = 0; i < 5; i++) { + const client = new net.Socket(); + rapidClients.push(client); + + client.connect(8590, 'localhost', () => { + console.log(`Rapid client ${i} connected`); + client.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n'); + + // Destroy after random delay + setTimeout(() => { + client.removeAllListeners(); + client.destroy(); + }, Math.random() * 500); + }); + + // Small delay between connections + await new Promise(resolve => setTimeout(resolve, 50)); + } + + // Wait a bit + await new Promise(resolve => setTimeout(resolve, 1000)); + + details = getConnectionDetails(); + console.log(`\nAfter rapid connections:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + // Wait for cleanup + console.log('\nWaiting for final cleanup...'); + await new Promise(resolve => setTimeout(resolve, 3000)); + + details = getConnectionDetails(); + console.log(`\nFinal state:`); + console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`); + console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`); + + // Cleanup + await outerProxy.stop(); + await innerProxy.stop(); + backend.close(); + + // Verify all connections are cleaned up + console.log('\n--- Verification ---'); + + if (details.outer.count === 0 && details.inner.count === 0) { + console.log('✅ PASS: All zombie connections were cleaned up'); + } else { + console.log('❌ FAIL: Some connections remain'); + } + + expect(details.outer.count).toEqual(0); + expect(details.inner.count).toEqual(0); + expect(details.outer.zombies.length).toEqual(0); + expect(details.inner.zombies.length).toEqual(0); + expect(details.outer.halfZombies.length).toEqual(0); + expect(details.inner.halfZombies.length).toEqual(0); +}); + +tap.start(); \ No newline at end of file diff --git a/ts/proxies/smart-proxy/connection-manager.ts b/ts/proxies/smart-proxy/connection-manager.ts index 0cedf73..b88f1af 100644 --- a/ts/proxies/smart-proxy/connection-manager.ts +++ b/ts/proxies/smart-proxy/connection-manager.ts @@ -456,6 +456,48 @@ export class ConnectionManager extends LifecycleComponent { } } + // Also check ALL connections for zombie state (destroyed sockets but not cleaned up) + // This is critical for proxy chains where sockets can be destroyed without events + for (const [connectionId, record] of this.connectionRecords) { + if (!record.connectionClosed) { + const incomingDestroyed = record.incoming?.destroyed || false; + const outgoingDestroyed = record.outgoing?.destroyed || false; + + // Check for zombie connections: both sockets destroyed but connection not cleaned up + if (incomingDestroyed && outgoingDestroyed) { + logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, { + connectionId, + remoteIP: record.remoteIP, + age: plugins.prettyMs(now - record.incomingStartTime), + component: 'connection-manager' + }); + + // Clean up immediately + this.cleanupConnection(record, 'zombie_cleanup'); + continue; + } + + // Check for half-zombie: one socket destroyed + if (incomingDestroyed || outgoingDestroyed) { + const age = now - record.incomingStartTime; + // Give it 30 seconds grace period for normal cleanup + if (age > 30000) { + logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, { + connectionId, + remoteIP: record.remoteIP, + age: plugins.prettyMs(age), + incomingDestroyed, + outgoingDestroyed, + component: 'connection-manager' + }); + + // Clean up + this.cleanupConnection(record, 'half_zombie_cleanup'); + } + } + } + } + // Process only connections that need checking for (const connectionId of connectionsToCheck) { const record = this.connectionRecords.get(connectionId);