Compare commits

...

6 Commits

Author SHA1 Message Date
6726de277e 19.5.26
Some checks failed
Default (tags) / security (push) Successful in 35s
Default (tags) / test (push) Failing after 27m56s
Default (tags) / release (push) Has been skipped
Default (tags) / metadata (push) Has been skipped
2025-06-08 12:26:32 +00:00
dc3eda5e29 fix accumulation 2025-06-08 12:25:31 +00:00
82a350bf51 19.5.25
Some checks failed
Default (tags) / security (push) Successful in 37s
Default (tags) / test (push) Failing after 24m58s
Default (tags) / release (push) Has been skipped
Default (tags) / metadata (push) Has been skipped
2025-06-07 20:37:52 +00:00
890e907664 fix(connection): filter zombie connections part 2 2025-06-07 20:37:49 +00:00
19590ef107 19.5.24
Some checks failed
Default (tags) / security (push) Successful in 32s
Default (tags) / test (push) Failing after 24m57s
Default (tags) / release (push) Has been skipped
Default (tags) / metadata (push) Has been skipped
2025-06-07 10:56:08 +00:00
47735adbf2 Implement zombie connection detection and cleanup in ConnectionManager; enhance tests for edge cases 2025-06-07 10:55:59 +00:00
8 changed files with 1214 additions and 6 deletions

View File

@ -1,6 +1,6 @@
{
"name": "@push.rocks/smartproxy",
"version": "19.5.23",
"version": "19.5.26",
"private": false,
"description": "A powerful proxy package with unified route-based configuration for high traffic management. Features include SSL/TLS support, flexible routing patterns, WebSocket handling, advanced security options, and automatic ACME certificate management.",
"main": "dist_ts/index.js",

View File

@ -372,4 +372,353 @@ The connection cleanup mechanisms have been significantly improved in v19.5.20:
2. Immediate routing cleanup handler always destroys outgoing connections
3. Tests confirm no accumulation in standard scenarios with reachable backends
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
However, the missing connection establishment timeout causes accumulation when backends are unreachable or very slow to connect.
### Outer Proxy Sudden Accumulation After Hours
**User Report**: "The counter goes up suddenly after some hours on the outer proxy"
**Investigation Findings**:
1. **Cleanup Queue Mechanism**:
- Connections are cleaned up in batches of 100 via a queue
- If the cleanup timer gets stuck or cleared without restart, connections accumulate
- The timer is set with `setTimeout` and could be affected by event loop blocking
2. **Potential Causes for Sudden Spikes**:
a) **Cleanup Timer Failure**:
```typescript
// In ConnectionManager, if this timer gets cleared but not restarted:
this.cleanupTimer = this.setTimeout(() => {
this.processCleanupQueue();
}, 100);
```
b) **Memory Pressure**:
- After hours of operation, memory fragmentation or pressure could cause delays
- Garbage collection pauses might interfere with timer execution
c) **Event Listener Accumulation**:
- Socket event listeners might accumulate over time
- Server 'connection' event handlers are particularly important
d) **Keep-Alive Connection Cascades**:
- When many keep-alive connections timeout simultaneously
- Outer proxy has different timeout than inner proxy
- Mass disconnection events can overwhelm cleanup queue
e) **HttpProxy Component Issues**:
- If using `useHttpProxy`, the HttpProxy bridge might maintain connection pools
- These pools might not be properly cleaned after hours
3. **Why "Sudden" After Hours**:
- Not a gradual leak but triggered by specific conditions
- Likely related to periodic events or thresholds:
- Inactivity check runs every 30 seconds
- Keep-alive connections have extended timeouts (6x normal)
- Parity check has 30-minute timeout for half-closed connections
4. **Reproduction Scenarios**:
- Mass client disconnection/reconnection (network blip)
- Keep-alive timeout cascade when inner proxy times out first
- Cleanup timer getting stuck during high load
- Memory pressure causing event loop delays
### Additional Monitoring Recommendations
1. **Add Cleanup Queue Monitoring**:
```typescript
setInterval(() => {
const cm = proxy.connectionManager;
if (cm.cleanupQueue.size > 100 && !cm.cleanupTimer) {
logger.error('Cleanup queue stuck!', {
queueSize: cm.cleanupQueue.size,
hasTimer: !!cm.cleanupTimer
});
}
}, 60000);
```
2. **Track Timer Health**:
- Monitor if cleanup timer is running
- Check for event loop blocking
- Log when batch processing takes too long
3. **Memory Monitoring**:
- Track heap usage over time
- Monitor for memory leaks in long-running processes
- Force periodic garbage collection if needed
### Immediate Mitigations
1. **Restart Cleanup Timer**:
```typescript
// Emergency cleanup timer restart
if (!cm.cleanupTimer && cm.cleanupQueue.size > 0) {
cm.cleanupTimer = setTimeout(() => {
cm.processCleanupQueue();
}, 100);
}
```
2. **Force Periodic Cleanup**:
```typescript
setInterval(() => {
const cm = connectionManager;
if (cm.getConnectionCount() > threshold) {
cm.performOptimizedInactivityCheck();
// Force process cleanup queue
cm.processCleanupQueue();
}
}, 300000); // Every 5 minutes
```
3. **Connection Age Limits**:
- Set maximum connection lifetime
- Force close connections older than threshold
- More aggressive cleanup for proxy chains
## ✅ FIXED: Zombie Connection Detection (January 2025)
### Root Cause Identified
"Zombie connections" occur when sockets are destroyed without triggering their close/error event handlers. This causes connections to remain tracked with both sockets destroyed but `connectionClosed=false`. This is particularly problematic in proxy chains where the inner proxy might close connections in ways that don't trigger proper events on the outer proxy.
### Fix Implemented
Added zombie detection to the periodic inactivity check in ConnectionManager:
```typescript
// In performOptimizedInactivityCheck()
// Check ALL connections for zombie state
for (const [connectionId, record] of this.connectionRecords) {
if (!record.connectionClosed) {
const incomingDestroyed = record.incoming?.destroyed || false;
const outgoingDestroyed = record.outgoing?.destroyed || false;
// Check for zombie connections: both sockets destroyed but not cleaned up
if (incomingDestroyed && outgoingDestroyed) {
logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(now - record.incomingStartTime),
component: 'connection-manager'
});
// Clean up immediately
this.cleanupConnection(record, 'zombie_cleanup');
continue;
}
// Check for half-zombie: one socket destroyed
if (incomingDestroyed || outgoingDestroyed) {
const age = now - record.incomingStartTime;
// Give it 30 seconds grace period for normal cleanup
if (age > 30000) {
logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
incomingDestroyed,
outgoingDestroyed,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'half_zombie_cleanup');
}
}
}
}
```
### How It Works
1. **Full Zombie Detection**: Detects when both incoming and outgoing sockets are destroyed but the connection hasn't been cleaned up
2. **Half-Zombie Detection**: Detects when only one socket is destroyed, with a 30-second grace period for normal cleanup to occur
3. **Automatic Cleanup**: Immediately cleans up zombie connections when detected
4. **Runs Periodically**: Integrated into the existing inactivity check that runs every 30 seconds
### Why This Fixes the Outer Proxy Accumulation
- When inner proxy closes connections abruptly (e.g., due to backend failure), the outer proxy's outgoing socket might be destroyed without firing close/error events
- These become zombie connections that previously accumulated indefinitely
- Now they are detected and cleaned up within 30 seconds
### Test Results
Debug scripts confirmed:
- Zombie connections can be created when sockets are destroyed directly without events
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
## 🔍 Production Diagnostics (January 2025)
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
### How to Use the Production Monitor
1. **Add to your proxy startup script**:
```typescript
import ProductionConnectionMonitor from './production-connection-monitor.js';
// After proxy.start()
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// Monitor will automatically capture diagnostics when:
// - Connections exceed threshold (default: 50)
// - Sudden spike occurs (default: +20 connections)
```
2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
3. **Force capture anytime**: `monitor.forceCaptureNow()`
### What the Monitor Captures
For each connection:
- Socket states (destroyed, readable, writable, readyState)
- Connection flags (closed, keepAlive, TLS status)
- Data transfer statistics
- Time since last activity
- Cleanup queue status
- Event listener counts
- Termination reasons
### Pattern Analysis
The monitor automatically identifies:
- **Zombie connections**: Both sockets destroyed but not cleaned up
- **Half-zombies**: One socket destroyed
- **Stuck connecting**: Outgoing socket stuck in connecting state
- **No outgoing**: Missing outgoing socket
- **Keep-alive stuck**: Keep-alive connections with no recent activity
- **Old connections**: Connections older than 1 hour
- **No data transfer**: Connections with no bytes transferred
- **Listener leaks**: Excessive event listeners
### Common Accumulation Patterns
1. **Connecting State Stuck**
- Outgoing socket shows `connecting: true` indefinitely
- Usually means connection timeout not working
- Check if backend is reachable
2. **Missing Outgoing Socket**
- Connection has no outgoing socket but isn't closed
- May indicate immediate routing issues
- Check error logs during connection setup
3. **Event Listener Accumulation**
- High listener counts (>20) on sockets
- Indicates cleanup not removing all listeners
- Can cause memory leaks
4. **Keep-Alive Zombies**
- Keep-alive connections not timing out
- Check keepAlive timeout settings
- May need more aggressive cleanup
### Next Steps
1. **Run the monitor in production** during accumulation
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
3. **Look for patterns** in the captured snapshots
4. **Check specific connection IDs** that accumulate
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
## ✅ FIXED: Stuck Connection Detection (January 2025)
### Additional Root Cause Found
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
- Both sockets remain alive (not destroyed)
- Keep-alive prevents normal timeout
- No data is sent back to the client despite receiving data
- These don't qualify as "zombies" since sockets aren't destroyed
### Fix Implemented
Added stuck connection detection to the periodic inactivity check:
```typescript
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
```
### What This Fixes
- Connections to backends that accept but never respond
- Proxy chains where inner proxy connects to unresponsive services
- Scenarios where keep-alive prevents normal timeout mechanisms
- Connections that receive client data but never send anything back
### Detection Criteria
- Connection has received bytes from client (`bytesReceived > 0`)
- No bytes sent back to client (`bytesSent === 0`)
- Connection is older than 60 seconds
- Both sockets are still alive (not destroyed)
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.
## 🚨 CRITICAL FIX: Cleanup Queue Bug (January 2025)
### Critical Bug Found
The cleanup queue had a severe bug that caused connection accumulation when more than 100 connections needed cleanup:
```typescript
// BUG: This cleared the ENTIRE queue after processing only the first batch!
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
this.cleanupQueue.clear(); // ❌ This discarded all connections beyond the first 100!
```
### Fix Implemented
```typescript
// Now only removes the connections being processed
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
for (const connectionId of toCleanup) {
this.cleanupQueue.delete(connectionId); // ✅ Only remove what we process
const record = this.connectionRecords.get(connectionId);
if (record) {
this.cleanupConnection(record, record.incomingTerminationReason || 'normal');
}
}
```
### Impact
- **Before**: If 150 connections needed cleanup, only the first 100 would be processed and the remaining 50 would accumulate forever
- **After**: All connections are properly cleaned up in batches
### Additional Improvements
1. **Faster Inactivity Checks**: Reduced from 30s to 10s intervals
- Zombies and stuck connections are detected 3x faster
- Reduces the window for accumulation
2. **Duplicate Prevention**: Added check in queueCleanup to prevent processing already-closed connections
- Prevents unnecessary work
- Ensures connections are only cleaned up once
### Summary of All Fixes
1. **Connection Timeout** (already documented) - Prevents accumulation when backends are unreachable
2. **Zombie Detection** - Cleans up connections with destroyed sockets
3. **Stuck Connection Detection** - Cleans up connections to hanging backends
4. **Cleanup Queue Bug** - Ensures ALL connections get cleaned up, not just the first 100
5. **Faster Detection** - Reduced check interval from 30s to 10s
These fixes combined should prevent connection accumulation in all known scenarios.

View File

@ -856,4 +856,42 @@ The WrappedSocket class has been implemented as the foundation for PROXY protoco
For detailed information about proxy protocol implementation and proxy chaining:
- **[Proxy Protocol Guide](./readme.proxy-protocol.md)** - Complete implementation details and configuration
- **[Proxy Protocol Examples](./readme.proxy-protocol-example.md)** - Code examples and conceptual implementation
- **[Proxy Chain Summary](./readme.proxy-chain-summary.md)** - Quick reference for proxy chaining setup
- **[Proxy Chain Summary](./readme.proxy-chain-summary.md)** - Quick reference for proxy chaining setup
## Connection Cleanup Edge Cases Investigation (v19.5.20+)
### Issue Discovered
"Zombie connections" can occur when both sockets are destroyed but the connection record hasn't been cleaned up. This happens when sockets are destroyed without triggering their close/error event handlers.
### Root Cause
1. **Event Handler Bypass**: In edge cases (network failures, proxy chain failures, forced socket destruction), sockets can be destroyed without their event handlers being called
2. **Cleanup Queue Delay**: The `initiateCleanupOnce` method adds connections to a cleanup queue (batch of 100 every 100ms), which may not process fast enough
3. **Inactivity Check Limitation**: The periodic inactivity check only examines `lastActivity` timestamps, not actual socket states
### Test Results
Debug script (`connection-manager-direct-test.ts`) revealed:
- **Normal cleanup works**: When socket events fire normally, cleanup is reliable
- **Zombies ARE created**: Direct socket destruction creates zombies (destroyed sockets, connectionClosed=false)
- **Manual cleanup works**: Calling `initiateCleanupOnce` on a zombie does clean it up
- **Inactivity check misses zombies**: The check doesn't detect connections with destroyed sockets
### Potential Solutions
1. **Periodic Zombie Detection**: Add zombie detection to the inactivity check:
```typescript
// In performOptimizedInactivityCheck
if (record.incoming?.destroyed && record.outgoing?.destroyed && !record.connectionClosed) {
this.cleanupConnection(record, 'zombie_detected');
}
```
2. **Socket State Monitoring**: Check socket states during connection operations
3. **Defensive Socket Handling**: Always attach cleanup handlers before any operation that might destroy sockets
4. **Immediate Cleanup Option**: For critical paths, use `cleanupConnection` instead of `initiateCleanupOnce`
### Impact
- Memory leaks in edge cases (network failures, proxy chain issues)
- Connection count inaccuracy
- Potential resource exhaustion over time
### Test Files
- `.nogit/debug/connection-manager-direct-test.ts` - Direct ConnectionManager testing showing zombie creation

202
readme.monitoring.md Normal file
View File

@ -0,0 +1,202 @@
# Production Connection Monitoring
This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.
## Quick Start
```typescript
import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';
// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()
```
## What Gets Captured
When accumulation is detected, the monitor saves a JSON file with:
### Connection Details
- Socket states (destroyed, readable, writable, readyState)
- Connection age and activity timestamps
- Data transfer statistics (bytes sent/received)
- Target host and port information
- Keep-alive status
- Event listener counts
### System State
- Memory usage
- Event loop lag
- Connection count trends
- Termination statistics
## Reading Diagnostic Files
Files are saved to `.nogit/connection-diagnostics/` with names like:
```
accumulation_2025-06-07T20-20-43-733Z_force_capture.json
```
### Key Fields to Check
1. **Socket States**
```json
"incomingState": {
"destroyed": false,
"readable": true,
"writable": true,
"readyState": "open"
}
```
- Both destroyed = zombie connection
- One destroyed = half-zombie
- Both alive but old = potential stuck connection
2. **Data Transfer**
```json
"bytesReceived": 36,
"bytesSent": 0,
"timeSinceLastActivity": 60000
```
- No bytes sent back = stuck connection
- High bytes but old = slow backend
- No activity = idle connection
3. **Connection Flags**
```json
"hasReceivedInitialData": false,
"hasKeepAlive": true,
"connectionClosed": false
```
- hasReceivedInitialData=false on non-TLS = immediate routing
- hasKeepAlive=true = extended timeout applies
- connectionClosed=false = still tracked
## Common Patterns
### 1. Hanging Backend Pattern
```json
{
"bytesReceived": 36,
"bytesSent": 0,
"age": 120000,
"targetHost": "backend.example.com",
"incomingState": { "destroyed": false },
"outgoingState": { "destroyed": false }
}
```
**Fix**: The stuck connection detection (60s timeout) should clean these up.
### 2. Zombie Connection Pattern
```json
{
"incomingState": { "destroyed": true },
"outgoingState": { "destroyed": true },
"connectionClosed": false
}
```
**Fix**: The zombie detection should clean these up within 30s.
### 3. Event Listener Leak Pattern
```json
{
"incomingListeners": {
"data": 15,
"error": 20,
"close": 18
}
}
```
**Issue**: Event listeners accumulating, potential memory leak.
### 4. No Outgoing Socket Pattern
```json
{
"outgoingState": { "exists": false },
"connectionClosed": false,
"age": 5000
}
```
**Issue**: Connection setup failed but cleanup didn't trigger.
## Forcing Diagnostic Capture
To capture current state immediately:
```typescript
monitor.forceCaptureNow();
```
This is useful when you notice accumulation starting.
## Automated Analysis
The monitor automatically analyzes patterns and logs:
- Zombie/half-zombie counts
- Stuck connection counts
- Old connection counts
- Memory usage
- Recommendations
## Integration Example
```typescript
// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';
async function startProxyWithMonitoring() {
const proxy = new SmartProxy({
// your config
});
await proxy.start();
// Start monitoring
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000);
// Optional: Capture on specific events
process.on('SIGUSR1', () => {
console.log('Manual diagnostic capture triggered');
monitor.forceCaptureNow();
});
// Graceful shutdown
process.on('SIGTERM', async () => {
monitor.stop();
await proxy.stop();
process.exit(0);
});
}
```
## Troubleshooting
### Monitor Not Detecting Accumulation
- Check threshold settings (default: 50 connections)
- Reduce check interval for faster detection
- Use forceCaptureNow() to capture current state
### Too Many False Positives
- Increase accumulation threshold
- Increase spike threshold
- Adjust check interval
### Missing Diagnostic Data
- Ensure output directory exists and is writable
- Check disk space
- Verify process has write permissions
## Next Steps
1. Deploy the monitor to production
2. Wait for accumulation to occur
3. Share diagnostic files for analysis
4. Apply targeted fixes based on patterns found
The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.

View File

@ -0,0 +1,93 @@
import { expect, tap } from '@git.zone/tstest/tapbundle';
import { SmartProxy } from '../ts/index.js';
tap.test('cleanup queue bug - verify queue processing handles more than batch size', async (tools) => {
console.log('\n=== Cleanup Queue Bug Test ===');
console.log('Purpose: Verify that the cleanup queue correctly processes all connections');
console.log('even when there are more than the batch size (100)');
// Create proxy
const proxy = new SmartProxy({
routes: [{
name: 'test-route',
match: { ports: 8588 },
action: {
type: 'forward',
target: { host: 'localhost', port: 9996 }
}
}],
enableDetailedLogging: false,
});
await proxy.start();
console.log('✓ Proxy started on port 8588');
// Access connection manager
const cm = (proxy as any).connectionManager;
// Create mock connection records
console.log('\n--- Creating 150 mock connections ---');
const mockConnections: any[] = [];
for (let i = 0; i < 150; i++) {
const mockRecord = {
id: `mock-${i}`,
incoming: { destroyed: true, remoteAddress: '127.0.0.1' },
outgoing: { destroyed: true },
connectionClosed: false,
incomingStartTime: Date.now(),
lastActivity: Date.now(),
remoteIP: '127.0.0.1',
remotePort: 10000 + i,
localPort: 8588,
bytesReceived: 100,
bytesSent: 100,
incomingTerminationReason: null,
cleanupTimer: null
};
// Add to connection records
cm.connectionRecords.set(mockRecord.id, mockRecord);
mockConnections.push(mockRecord);
}
console.log(`Created ${cm.getConnectionCount()} mock connections`);
expect(cm.getConnectionCount()).toEqual(150);
// Queue all connections for cleanup
console.log('\n--- Queueing all connections for cleanup ---');
for (const conn of mockConnections) {
cm.initiateCleanupOnce(conn, 'test_cleanup');
}
console.log(`Cleanup queue size: ${cm.cleanupQueue.size}`);
expect(cm.cleanupQueue.size).toEqual(150);
// Wait for cleanup to complete
console.log('\n--- Waiting for cleanup batches to process ---');
// The first batch should process immediately (100 connections)
// Then additional batches should be scheduled
await new Promise(resolve => setTimeout(resolve, 500));
// Check final state
const finalCount = cm.getConnectionCount();
console.log(`\nFinal connection count: ${finalCount}`);
console.log(`Cleanup queue size: ${cm.cleanupQueue.size}`);
// All connections should be cleaned up
expect(finalCount).toEqual(0);
expect(cm.cleanupQueue.size).toEqual(0);
// Verify termination stats
const stats = cm.getTerminationStats();
console.log('Termination stats:', stats);
expect(stats.incoming.test_cleanup).toEqual(150);
// Cleanup
await proxy.stop();
console.log('\n✓ Test complete: Cleanup queue now correctly processes all connections');
});
tap.start();

View File

@ -0,0 +1,144 @@
import { expect, tap } from '@git.zone/tstest/tapbundle';
import * as net from 'net';
import { SmartProxy } from '../ts/index.js';
import * as plugins from '../ts/plugins.js';
tap.test('stuck connection cleanup - verify connections to hanging backends are cleaned up', async (tools) => {
console.log('\n=== Stuck Connection Cleanup Test ===');
console.log('Purpose: Verify that connections to backends that accept but never respond are cleaned up');
// Create a hanging backend that accepts connections but never responds
let backendConnections = 0;
const hangingBackend = net.createServer((socket) => {
backendConnections++;
console.log(`Hanging backend: Connection ${backendConnections} received`);
// Accept the connection but never send any data back
// This simulates a hung backend service
});
await new Promise<void>((resolve) => {
hangingBackend.listen(9997, () => {
console.log('✓ Hanging backend started on port 9997');
resolve();
});
});
// Create proxy that forwards to hanging backend
const proxy = new SmartProxy({
routes: [{
name: 'to-hanging-backend',
match: { ports: 8589 },
action: {
type: 'forward',
target: { host: 'localhost', port: 9997 }
}
}],
keepAlive: true,
enableDetailedLogging: false,
inactivityTimeout: 5000, // 5 second inactivity check interval for faster testing
});
await proxy.start();
console.log('✓ Proxy started on port 8589');
// Create connections that will get stuck
console.log('\n--- Creating connections to hanging backend ---');
const clients: net.Socket[] = [];
for (let i = 0; i < 5; i++) {
const client = net.connect(8589, 'localhost');
clients.push(client);
await new Promise<void>((resolve) => {
client.on('connect', () => {
console.log(`Client ${i} connected`);
// Send data that will never get a response
client.write(`GET / HTTP/1.1\r\nHost: localhost\r\n\r\n`);
resolve();
});
client.on('error', (err) => {
console.log(`Client ${i} error: ${err.message}`);
resolve();
});
});
}
// Wait a moment for connections to establish
await plugins.smartdelay.delayFor(1000);
// Check initial connection count
const initialCount = (proxy as any).connectionManager.getConnectionCount();
console.log(`\nInitial connection count: ${initialCount}`);
expect(initialCount).toEqual(5);
// Get connection details
const connections = (proxy as any).connectionManager.getConnections();
let stuckCount = 0;
for (const [id, record] of connections) {
if (record.bytesReceived > 0 && record.bytesSent === 0) {
stuckCount++;
console.log(`Stuck connection ${id}: received=${record.bytesReceived}, sent=${record.bytesSent}`);
}
}
console.log(`Stuck connections found: ${stuckCount}`);
expect(stuckCount).toEqual(5);
// Wait for inactivity check to run (it checks every 30s by default, but we set it to 5s)
console.log('\n--- Waiting for stuck connection detection (65 seconds) ---');
console.log('Note: Stuck connections are cleaned up after 60 seconds with no response');
// Speed up time by manually triggering inactivity check after simulating time passage
// First, age the connections by updating their timestamps
const now = Date.now();
for (const [id, record] of connections) {
// Simulate that these connections are 61 seconds old
record.incomingStartTime = now - 61000;
record.lastActivity = now - 61000;
}
// Manually trigger inactivity check
console.log('Manually triggering inactivity check...');
(proxy as any).connectionManager.performOptimizedInactivityCheck();
// Wait for cleanup to complete
await plugins.smartdelay.delayFor(1000);
// Check connection count after cleanup
const afterCleanupCount = (proxy as any).connectionManager.getConnectionCount();
console.log(`\nConnection count after cleanup: ${afterCleanupCount}`);
// Verify termination stats
const stats = (proxy as any).connectionManager.getTerminationStats();
console.log('\nTermination stats:', stats);
// All connections should be cleaned up as "stuck_no_response"
expect(afterCleanupCount).toEqual(0);
// The termination reason might be under incoming or general stats
const stuckCleanups = (stats.incoming.stuck_no_response || 0) +
(stats.outgoing?.stuck_no_response || 0);
console.log(`Stuck cleanups detected: ${stuckCleanups}`);
expect(stuckCleanups).toBeGreaterThan(0);
// Verify clients were disconnected
let closedClients = 0;
for (const client of clients) {
if (client.destroyed) {
closedClients++;
}
}
console.log(`Closed clients: ${closedClients}/5`);
expect(closedClients).toEqual(5);
// Cleanup
console.log('\n--- Cleanup ---');
await proxy.stop();
hangingBackend.close();
console.log('✓ Test complete: Stuck connections are properly detected and cleaned up');
});
tap.start();

View File

@ -0,0 +1,306 @@
import { tap, expect } from '@git.zone/tstest/tapbundle';
import * as net from 'net';
import * as plugins from '../ts/plugins.js';
// Import SmartProxy
import { SmartProxy } from '../ts/index.js';
// Import types through type-only imports
import type { ConnectionManager } from '../ts/proxies/smart-proxy/connection-manager.js';
import type { IConnectionRecord } from '../ts/proxies/smart-proxy/models/interfaces.js';
tap.test('zombie connection cleanup - verify inactivity check detects and cleans destroyed sockets', async () => {
console.log('\n=== Zombie Connection Cleanup Test ===');
console.log('Purpose: Verify that connections with destroyed sockets are detected and cleaned up');
console.log('Setup: Client → OuterProxy (8590) → InnerProxy (8591) → Backend (9998)');
// Create backend server that can be controlled
let acceptConnections = true;
let destroyImmediately = false;
const backendConnections: net.Socket[] = [];
const backend = net.createServer((socket) => {
console.log('Backend: Connection received');
backendConnections.push(socket);
if (destroyImmediately) {
console.log('Backend: Destroying connection immediately');
socket.destroy();
} else {
socket.on('data', (data) => {
console.log('Backend: Received data, echoing back');
socket.write(data);
});
}
});
await new Promise<void>((resolve) => {
backend.listen(9998, () => {
console.log('✓ Backend server started on port 9998');
resolve();
});
});
// Create InnerProxy with faster inactivity check for testing
const innerProxy = new SmartProxy({
ports: [8591],
enableDetailedLogging: true,
inactivityTimeout: 5000, // 5 seconds for faster testing
inactivityCheckInterval: 1000, // Check every second
routes: [{
name: 'to-backend',
match: { ports: 8591 },
action: {
type: 'forward',
target: {
host: 'localhost',
port: 9998
}
}
}]
});
// Create OuterProxy with faster inactivity check
const outerProxy = new SmartProxy({
ports: [8590],
enableDetailedLogging: true,
inactivityTimeout: 5000, // 5 seconds for faster testing
inactivityCheckInterval: 1000, // Check every second
routes: [{
name: 'to-inner',
match: { ports: 8590 },
action: {
type: 'forward',
target: {
host: 'localhost',
port: 8591
}
}
}]
});
await innerProxy.start();
console.log('✓ InnerProxy started on port 8591');
await outerProxy.start();
console.log('✓ OuterProxy started on port 8590');
// Helper to get connection details
const getConnectionDetails = () => {
const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager;
const innerConnMgr = (innerProxy as any).connectionManager as ConnectionManager;
const outerRecords = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[];
const innerRecords = Array.from((innerConnMgr as any).connectionRecords.values()) as IConnectionRecord[];
return {
outer: {
count: outerConnMgr.getConnectionCount(),
records: outerRecords,
zombies: outerRecords.filter(r =>
!r.connectionClosed &&
r.incoming?.destroyed &&
(r.outgoing?.destroyed ?? true)
),
halfZombies: outerRecords.filter(r =>
!r.connectionClosed &&
(r.incoming?.destroyed || r.outgoing?.destroyed) &&
!(r.incoming?.destroyed && (r.outgoing?.destroyed ?? true))
)
},
inner: {
count: innerConnMgr.getConnectionCount(),
records: innerRecords,
zombies: innerRecords.filter(r =>
!r.connectionClosed &&
r.incoming?.destroyed &&
(r.outgoing?.destroyed ?? true)
),
halfZombies: innerRecords.filter(r =>
!r.connectionClosed &&
(r.incoming?.destroyed || r.outgoing?.destroyed) &&
!(r.incoming?.destroyed && (r.outgoing?.destroyed ?? true))
)
}
};
};
console.log('\n--- Test 1: Create zombie by destroying sockets without events ---');
// Create a connection and forcefully destroy sockets to create zombies
const client1 = new net.Socket();
await new Promise<void>((resolve) => {
client1.connect(8590, 'localhost', () => {
console.log('Client1 connected to OuterProxy');
client1.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n');
// Wait for connection to be established through the chain
setTimeout(() => {
console.log('Forcefully destroying backend connections to create zombies');
// Get connection details before destruction
const beforeDetails = getConnectionDetails();
console.log(`Before destruction: Outer=${beforeDetails.outer.count}, Inner=${beforeDetails.inner.count}`);
// Destroy all backend connections without proper close events
backendConnections.forEach(conn => {
if (!conn.destroyed) {
// Remove all listeners to prevent proper cleanup
conn.removeAllListeners();
conn.destroy();
}
});
// Also destroy the client socket abruptly
client1.removeAllListeners();
client1.destroy();
resolve();
}, 500);
});
});
// Check immediately after destruction
await new Promise(resolve => setTimeout(resolve, 100));
let details = getConnectionDetails();
console.log(`\nAfter destruction:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
// Wait for inactivity check to run (should detect zombies)
console.log('\nWaiting for inactivity check to detect zombies...');
await new Promise(resolve => setTimeout(resolve, 2000));
details = getConnectionDetails();
console.log(`\nAfter first inactivity check:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
console.log('\n--- Test 2: Create half-zombie by destroying only one socket ---');
// Clear backend connections array
backendConnections.length = 0;
const client2 = new net.Socket();
await new Promise<void>((resolve) => {
client2.connect(8590, 'localhost', () => {
console.log('Client2 connected to OuterProxy');
client2.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n');
setTimeout(() => {
console.log('Creating half-zombie by destroying only outgoing socket on outer proxy');
// Access the connection records directly
const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager;
const outerRecords = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[];
// Find the active connection and destroy only its outgoing socket
const activeRecord = outerRecords.find(r => !r.connectionClosed && r.outgoing && !r.outgoing.destroyed);
if (activeRecord && activeRecord.outgoing) {
console.log('Found active connection, destroying outgoing socket');
activeRecord.outgoing.removeAllListeners();
activeRecord.outgoing.destroy();
}
resolve();
}, 500);
});
});
// Check half-zombie state
await new Promise(resolve => setTimeout(resolve, 100));
details = getConnectionDetails();
console.log(`\nAfter creating half-zombie:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
// Wait for 30-second grace period (simulated by multiple checks)
console.log('\nWaiting for half-zombie grace period (30 seconds simulated)...');
// Manually age the connection to trigger half-zombie cleanup
const outerConnMgr = (outerProxy as any).connectionManager as ConnectionManager;
const records = Array.from((outerConnMgr as any).connectionRecords.values()) as IConnectionRecord[];
records.forEach(record => {
if (!record.connectionClosed) {
// Age the connection by 35 seconds
record.incomingStartTime -= 35000;
}
});
// Trigger inactivity check
await new Promise(resolve => setTimeout(resolve, 2000));
details = getConnectionDetails();
console.log(`\nAfter half-zombie cleanup:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
// Clean up client2 properly
if (!client2.destroyed) {
client2.destroy();
}
console.log('\n--- Test 3: Rapid zombie creation under load ---');
// Create multiple connections rapidly and destroy them
const rapidClients: net.Socket[] = [];
for (let i = 0; i < 5; i++) {
const client = new net.Socket();
rapidClients.push(client);
client.connect(8590, 'localhost', () => {
console.log(`Rapid client ${i} connected`);
client.write('GET / HTTP/1.1\r\nHost: test.com\r\n\r\n');
// Destroy after random delay
setTimeout(() => {
client.removeAllListeners();
client.destroy();
}, Math.random() * 500);
});
// Small delay between connections
await new Promise(resolve => setTimeout(resolve, 50));
}
// Wait a bit
await new Promise(resolve => setTimeout(resolve, 1000));
details = getConnectionDetails();
console.log(`\nAfter rapid connections:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
// Wait for cleanup
console.log('\nWaiting for final cleanup...');
await new Promise(resolve => setTimeout(resolve, 3000));
details = getConnectionDetails();
console.log(`\nFinal state:`);
console.log(` Outer: ${details.outer.count} connections, ${details.outer.zombies.length} zombies, ${details.outer.halfZombies.length} half-zombies`);
console.log(` Inner: ${details.inner.count} connections, ${details.inner.zombies.length} zombies, ${details.inner.halfZombies.length} half-zombies`);
// Cleanup
await outerProxy.stop();
await innerProxy.stop();
backend.close();
// Verify all connections are cleaned up
console.log('\n--- Verification ---');
if (details.outer.count === 0 && details.inner.count === 0) {
console.log('✅ PASS: All zombie connections were cleaned up');
} else {
console.log('❌ FAIL: Some connections remain');
}
expect(details.outer.count).toEqual(0);
expect(details.inner.count).toEqual(0);
expect(details.outer.zombies.length).toEqual(0);
expect(details.inner.zombies.length).toEqual(0);
expect(details.outer.halfZombies.length).toEqual(0);
expect(details.inner.halfZombies.length).toEqual(0);
});
tap.start();

View File

@ -140,10 +140,10 @@ export class ConnectionManager extends LifecycleComponent {
* Start the inactivity check timer
*/
private startInactivityCheckTimer(): void {
// Check every 30 seconds for connections that need inactivity check
// Check more frequently (every 10 seconds) to catch zombies and stuck connections faster
this.setInterval(() => {
this.performOptimizedInactivityCheck();
}, 30000);
}, 10000);
// Note: LifecycleComponent's setInterval already calls unref()
}
@ -194,6 +194,13 @@ export class ConnectionManager extends LifecycleComponent {
* Queue a connection for cleanup
*/
private queueCleanup(connectionId: string): void {
// Check if connection is already being processed
const record = this.connectionRecords.get(connectionId);
if (!record || record.connectionClosed) {
// Already cleaned up or doesn't exist, skip
return;
}
this.cleanupQueue.add(connectionId);
// Process immediately if queue is getting large
@ -217,9 +224,10 @@ export class ConnectionManager extends LifecycleComponent {
}
const toCleanup = Array.from(this.cleanupQueue).slice(0, this.cleanupBatchSize);
this.cleanupQueue.clear();
// Remove only the items we're processing, not the entire queue!
for (const connectionId of toCleanup) {
this.cleanupQueue.delete(connectionId);
const record = this.connectionRecords.get(connectionId);
if (record) {
this.cleanupConnection(record, record.incomingTerminationReason || 'normal');
@ -456,6 +464,74 @@ export class ConnectionManager extends LifecycleComponent {
}
}
// Also check ALL connections for zombie state (destroyed sockets but not cleaned up)
// This is critical for proxy chains where sockets can be destroyed without events
for (const [connectionId, record] of this.connectionRecords) {
if (!record.connectionClosed) {
const incomingDestroyed = record.incoming?.destroyed || false;
const outgoingDestroyed = record.outgoing?.destroyed || false;
// Check for zombie connections: both sockets destroyed but connection not cleaned up
if (incomingDestroyed && outgoingDestroyed) {
logger.log('warn', `Zombie connection detected: ${connectionId} - both sockets destroyed but not cleaned up`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(now - record.incomingStartTime),
component: 'connection-manager'
});
// Clean up immediately
this.cleanupConnection(record, 'zombie_cleanup');
continue;
}
// Check for half-zombie: one socket destroyed
if (incomingDestroyed || outgoingDestroyed) {
const age = now - record.incomingStartTime;
// Give it 30 seconds grace period for normal cleanup
if (age > 30000) {
logger.log('warn', `Half-zombie connection detected: ${connectionId} - ${incomingDestroyed ? 'incoming' : 'outgoing'} destroyed`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
incomingDestroyed,
outgoingDestroyed,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'half_zombie_cleanup');
}
}
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Set termination reason and increment stats
if (record.incomingTerminationReason == null) {
record.incomingTerminationReason = 'stuck_no_response';
this.incrementTerminationStat('incoming', 'stuck_no_response');
}
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
}
}
// Process only connections that need checking
for (const connectionId of connectionsToCheck) {
const record = this.connectionRecords.get(connectionId);