fix(connection): filter zombie connections part 2

This commit is contained in:
Juergen Kunz
2025-06-07 20:37:49 +00:00
parent 19590ef107
commit 890e907664
4 changed files with 498 additions and 1 deletions

View File

@ -548,4 +548,129 @@ Debug scripts confirmed:
- The zombie detection successfully identifies and cleans up these connections
- Both full zombies (both sockets destroyed) and half-zombies (one socket destroyed) are handled
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
This fix addresses the specific issue where "connections that are closed on the inner proxy, always also close on the outer proxy" as requested by the user.
## 🔍 Production Diagnostics (January 2025)
Since the zombie detection fix didn't fully resolve the issue, use the ProductionConnectionMonitor to diagnose the actual problem:
### How to Use the Production Monitor
1. **Add to your proxy startup script**:
```typescript
import ProductionConnectionMonitor from './production-connection-monitor.js';
// After proxy.start()
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// Monitor will automatically capture diagnostics when:
// - Connections exceed threshold (default: 50)
// - Sudden spike occurs (default: +20 connections)
```
2. **Diagnostics are saved to**: `.nogit/connection-diagnostics/`
3. **Force capture anytime**: `monitor.forceCaptureNow()`
### What the Monitor Captures
For each connection:
- Socket states (destroyed, readable, writable, readyState)
- Connection flags (closed, keepAlive, TLS status)
- Data transfer statistics
- Time since last activity
- Cleanup queue status
- Event listener counts
- Termination reasons
### Pattern Analysis
The monitor automatically identifies:
- **Zombie connections**: Both sockets destroyed but not cleaned up
- **Half-zombies**: One socket destroyed
- **Stuck connecting**: Outgoing socket stuck in connecting state
- **No outgoing**: Missing outgoing socket
- **Keep-alive stuck**: Keep-alive connections with no recent activity
- **Old connections**: Connections older than 1 hour
- **No data transfer**: Connections with no bytes transferred
- **Listener leaks**: Excessive event listeners
### Common Accumulation Patterns
1. **Connecting State Stuck**
- Outgoing socket shows `connecting: true` indefinitely
- Usually means connection timeout not working
- Check if backend is reachable
2. **Missing Outgoing Socket**
- Connection has no outgoing socket but isn't closed
- May indicate immediate routing issues
- Check error logs during connection setup
3. **Event Listener Accumulation**
- High listener counts (>20) on sockets
- Indicates cleanup not removing all listeners
- Can cause memory leaks
4. **Keep-Alive Zombies**
- Keep-alive connections not timing out
- Check keepAlive timeout settings
- May need more aggressive cleanup
### Next Steps
1. **Run the monitor in production** during accumulation
2. **Share the diagnostic files** from `.nogit/connection-diagnostics/`
3. **Look for patterns** in the captured snapshots
4. **Check specific connection IDs** that accumulate
The diagnostic files will show exactly what state connections are in when accumulation occurs, allowing targeted fixes for the specific issue.
## ✅ FIXED: Stuck Connection Detection (January 2025)
### Additional Root Cause Found
Connections to hanging backends (that accept but never respond) were not being cleaned up because:
- Both sockets remain alive (not destroyed)
- Keep-alive prevents normal timeout
- No data is sent back to the client despite receiving data
- These don't qualify as "zombies" since sockets aren't destroyed
### Fix Implemented
Added stuck connection detection to the periodic inactivity check:
```typescript
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
```
### What This Fixes
- Connections to backends that accept but never respond
- Proxy chains where inner proxy connects to unresponsive services
- Scenarios where keep-alive prevents normal timeout mechanisms
- Connections that receive client data but never send anything back
### Detection Criteria
- Connection has received bytes from client (`bytesReceived > 0`)
- No bytes sent back to client (`bytesSent === 0`)
- Connection is older than 60 seconds
- Both sockets are still alive (not destroyed)
This complements the zombie detection by handling cases where sockets remain technically alive but the connection is effectively dead.

202
readme.monitoring.md Normal file
View File

@ -0,0 +1,202 @@
# Production Connection Monitoring
This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.
## Quick Start
```typescript
import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';
// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()
```
## What Gets Captured
When accumulation is detected, the monitor saves a JSON file with:
### Connection Details
- Socket states (destroyed, readable, writable, readyState)
- Connection age and activity timestamps
- Data transfer statistics (bytes sent/received)
- Target host and port information
- Keep-alive status
- Event listener counts
### System State
- Memory usage
- Event loop lag
- Connection count trends
- Termination statistics
## Reading Diagnostic Files
Files are saved to `.nogit/connection-diagnostics/` with names like:
```
accumulation_2025-06-07T20-20-43-733Z_force_capture.json
```
### Key Fields to Check
1. **Socket States**
```json
"incomingState": {
"destroyed": false,
"readable": true,
"writable": true,
"readyState": "open"
}
```
- Both destroyed = zombie connection
- One destroyed = half-zombie
- Both alive but old = potential stuck connection
2. **Data Transfer**
```json
"bytesReceived": 36,
"bytesSent": 0,
"timeSinceLastActivity": 60000
```
- No bytes sent back = stuck connection
- High bytes but old = slow backend
- No activity = idle connection
3. **Connection Flags**
```json
"hasReceivedInitialData": false,
"hasKeepAlive": true,
"connectionClosed": false
```
- hasReceivedInitialData=false on non-TLS = immediate routing
- hasKeepAlive=true = extended timeout applies
- connectionClosed=false = still tracked
## Common Patterns
### 1. Hanging Backend Pattern
```json
{
"bytesReceived": 36,
"bytesSent": 0,
"age": 120000,
"targetHost": "backend.example.com",
"incomingState": { "destroyed": false },
"outgoingState": { "destroyed": false }
}
```
**Fix**: The stuck connection detection (60s timeout) should clean these up.
### 2. Zombie Connection Pattern
```json
{
"incomingState": { "destroyed": true },
"outgoingState": { "destroyed": true },
"connectionClosed": false
}
```
**Fix**: The zombie detection should clean these up within 30s.
### 3. Event Listener Leak Pattern
```json
{
"incomingListeners": {
"data": 15,
"error": 20,
"close": 18
}
}
```
**Issue**: Event listeners accumulating, potential memory leak.
### 4. No Outgoing Socket Pattern
```json
{
"outgoingState": { "exists": false },
"connectionClosed": false,
"age": 5000
}
```
**Issue**: Connection setup failed but cleanup didn't trigger.
## Forcing Diagnostic Capture
To capture current state immediately:
```typescript
monitor.forceCaptureNow();
```
This is useful when you notice accumulation starting.
## Automated Analysis
The monitor automatically analyzes patterns and logs:
- Zombie/half-zombie counts
- Stuck connection counts
- Old connection counts
- Memory usage
- Recommendations
## Integration Example
```typescript
// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';
async function startProxyWithMonitoring() {
const proxy = new SmartProxy({
// your config
});
await proxy.start();
// Start monitoring
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000);
// Optional: Capture on specific events
process.on('SIGUSR1', () => {
console.log('Manual diagnostic capture triggered');
monitor.forceCaptureNow();
});
// Graceful shutdown
process.on('SIGTERM', async () => {
monitor.stop();
await proxy.stop();
process.exit(0);
});
}
```
## Troubleshooting
### Monitor Not Detecting Accumulation
- Check threshold settings (default: 50 connections)
- Reduce check interval for faster detection
- Use forceCaptureNow() to capture current state
### Too Many False Positives
- Increase accumulation threshold
- Increase spike threshold
- Adjust check interval
### Missing Diagnostic Data
- Ensure output directory exists and is writable
- Check disk space
- Verify process has write permissions
## Next Steps
1. Deploy the monitor to production
2. Wait for accumulation to occur
3. Share diagnostic files for analysis
4. Apply targeted fixes based on patterns found
The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.

View File

@ -0,0 +1,144 @@
import { expect, tap } from '@git.zone/tstest/tapbundle';
import * as net from 'net';
import { SmartProxy } from '../ts/index.js';
import * as plugins from '../ts/plugins.js';
tap.test('stuck connection cleanup - verify connections to hanging backends are cleaned up', async (tools) => {
console.log('\n=== Stuck Connection Cleanup Test ===');
console.log('Purpose: Verify that connections to backends that accept but never respond are cleaned up');
// Create a hanging backend that accepts connections but never responds
let backendConnections = 0;
const hangingBackend = net.createServer((socket) => {
backendConnections++;
console.log(`Hanging backend: Connection ${backendConnections} received`);
// Accept the connection but never send any data back
// This simulates a hung backend service
});
await new Promise<void>((resolve) => {
hangingBackend.listen(9997, () => {
console.log('✓ Hanging backend started on port 9997');
resolve();
});
});
// Create proxy that forwards to hanging backend
const proxy = new SmartProxy({
routes: [{
name: 'to-hanging-backend',
match: { ports: 8589 },
action: {
type: 'forward',
target: { host: 'localhost', port: 9997 }
}
}],
keepAlive: true,
enableDetailedLogging: false,
inactivityTimeout: 5000, // 5 second inactivity check interval for faster testing
});
await proxy.start();
console.log('✓ Proxy started on port 8589');
// Create connections that will get stuck
console.log('\n--- Creating connections to hanging backend ---');
const clients: net.Socket[] = [];
for (let i = 0; i < 5; i++) {
const client = net.connect(8589, 'localhost');
clients.push(client);
await new Promise<void>((resolve) => {
client.on('connect', () => {
console.log(`Client ${i} connected`);
// Send data that will never get a response
client.write(`GET / HTTP/1.1\r\nHost: localhost\r\n\r\n`);
resolve();
});
client.on('error', (err) => {
console.log(`Client ${i} error: ${err.message}`);
resolve();
});
});
}
// Wait a moment for connections to establish
await plugins.smartdelay.delayFor(1000);
// Check initial connection count
const initialCount = (proxy as any).connectionManager.getConnectionCount();
console.log(`\nInitial connection count: ${initialCount}`);
expect(initialCount).toEqual(5);
// Get connection details
const connections = (proxy as any).connectionManager.getConnections();
let stuckCount = 0;
for (const [id, record] of connections) {
if (record.bytesReceived > 0 && record.bytesSent === 0) {
stuckCount++;
console.log(`Stuck connection ${id}: received=${record.bytesReceived}, sent=${record.bytesSent}`);
}
}
console.log(`Stuck connections found: ${stuckCount}`);
expect(stuckCount).toEqual(5);
// Wait for inactivity check to run (it checks every 30s by default, but we set it to 5s)
console.log('\n--- Waiting for stuck connection detection (65 seconds) ---');
console.log('Note: Stuck connections are cleaned up after 60 seconds with no response');
// Speed up time by manually triggering inactivity check after simulating time passage
// First, age the connections by updating their timestamps
const now = Date.now();
for (const [id, record] of connections) {
// Simulate that these connections are 61 seconds old
record.incomingStartTime = now - 61000;
record.lastActivity = now - 61000;
}
// Manually trigger inactivity check
console.log('Manually triggering inactivity check...');
(proxy as any).connectionManager.performOptimizedInactivityCheck();
// Wait for cleanup to complete
await plugins.smartdelay.delayFor(1000);
// Check connection count after cleanup
const afterCleanupCount = (proxy as any).connectionManager.getConnectionCount();
console.log(`\nConnection count after cleanup: ${afterCleanupCount}`);
// Verify termination stats
const stats = (proxy as any).connectionManager.getTerminationStats();
console.log('\nTermination stats:', stats);
// All connections should be cleaned up as "stuck_no_response"
expect(afterCleanupCount).toEqual(0);
// The termination reason might be under incoming or general stats
const stuckCleanups = (stats.incoming.stuck_no_response || 0) +
(stats.outgoing?.stuck_no_response || 0);
console.log(`Stuck cleanups detected: ${stuckCleanups}`);
expect(stuckCleanups).toBeGreaterThan(0);
// Verify clients were disconnected
let closedClients = 0;
for (const client of clients) {
if (client.destroyed) {
closedClients++;
}
}
console.log(`Closed clients: ${closedClients}/5`);
expect(closedClients).toEqual(5);
// Cleanup
console.log('\n--- Cleanup ---');
await proxy.stop();
hangingBackend.close();
console.log('✓ Test complete: Stuck connections are properly detected and cleaned up');
});
tap.start();

View File

@ -495,6 +495,32 @@ export class ConnectionManager extends LifecycleComponent {
this.cleanupConnection(record, 'half_zombie_cleanup');
}
}
// Check for stuck connections: no data sent back to client
if (!record.connectionClosed && record.outgoing && record.bytesReceived > 0 && record.bytesSent === 0) {
const age = now - record.incomingStartTime;
// If connection is older than 60 seconds and no data sent back, likely stuck
if (age > 60000) {
logger.log('warn', `Stuck connection detected: ${connectionId} - received ${record.bytesReceived} bytes but sent 0 bytes`, {
connectionId,
remoteIP: record.remoteIP,
age: plugins.prettyMs(age),
bytesReceived: record.bytesReceived,
targetHost: record.targetHost,
targetPort: record.targetPort,
component: 'connection-manager'
});
// Set termination reason and increment stats
if (record.incomingTerminationReason == null) {
record.incomingTerminationReason = 'stuck_no_response';
this.incrementTerminationStat('incoming', 'stuck_no_response');
}
// Clean up
this.cleanupConnection(record, 'stuck_no_response');
}
}
}
}