Files

Juergen Kunz 890e907664 fix(connection): filter zombie connections part 2

2025-06-07 20:37:49 +00:00

4.8 KiB

Raw Permalink Blame History

Production Connection Monitoring

This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.

Quick Start

import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';

// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds

// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()

What Gets Captured

When accumulation is detected, the monitor saves a JSON file with:

Connection Details

Socket states (destroyed, readable, writable, readyState)
Connection age and activity timestamps
Data transfer statistics (bytes sent/received)
Target host and port information
Keep-alive status
Event listener counts

System State

Memory usage
Event loop lag
Connection count trends
Termination statistics

Reading Diagnostic Files

Files are saved to .nogit/connection-diagnostics/ with names like:

accumulation_2025-06-07T20-20-43-733Z_force_capture.json

Key Fields to Check

Socket States

"incomingState": {
  "destroyed": false,
  "readable": true,
  "writable": true,
  "readyState": "open"
}

Both destroyed = zombie connection
One destroyed = half-zombie
Both alive but old = potential stuck connection

Data Transfer
```
"bytesReceived": 36,
"bytesSent": 0,
"timeSinceLastActivity": 60000
```
- No bytes sent back = stuck connection
- High bytes but old = slow backend
- No activity = idle connection
Connection Flags
```
"hasReceivedInitialData": false,
"hasKeepAlive": true,
"connectionClosed": false
```
- hasReceivedInitialData=false on non-TLS = immediate routing
- hasKeepAlive=true = extended timeout applies
- connectionClosed=false = still tracked

Common Patterns

1. Hanging Backend Pattern

{
  "bytesReceived": 36,
  "bytesSent": 0,
  "age": 120000,
  "targetHost": "backend.example.com",
  "incomingState": { "destroyed": false },
  "outgoingState": { "destroyed": false }
}

Fix: The stuck connection detection (60s timeout) should clean these up.

2. Zombie Connection Pattern

{
  "incomingState": { "destroyed": true },
  "outgoingState": { "destroyed": true },
  "connectionClosed": false
}

Fix: The zombie detection should clean these up within 30s.

3. Event Listener Leak Pattern

{
  "incomingListeners": {
    "data": 15,
    "error": 20,
    "close": 18
  }
}

Issue: Event listeners accumulating, potential memory leak.

4. No Outgoing Socket Pattern

{
  "outgoingState": { "exists": false },
  "connectionClosed": false,
  "age": 5000
}

Issue: Connection setup failed but cleanup didn't trigger.

Forcing Diagnostic Capture

To capture current state immediately:

monitor.forceCaptureNow();

This is useful when you notice accumulation starting.

Automated Analysis

The monitor automatically analyzes patterns and logs:

Zombie/half-zombie counts
Stuck connection counts
Old connection counts
Memory usage
Recommendations

Integration Example

// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';

async function startProxyWithMonitoring() {
  const proxy = new SmartProxy({
    // your config
  });
  
  await proxy.start();
  
  // Start monitoring
  const monitor = new ProductionConnectionMonitor(proxy);
  monitor.start(5000);
  
  // Optional: Capture on specific events
  process.on('SIGUSR1', () => {
    console.log('Manual diagnostic capture triggered');
    monitor.forceCaptureNow();
  });
  
  // Graceful shutdown
  process.on('SIGTERM', async () => {
    monitor.stop();
    await proxy.stop();
    process.exit(0);
  });
}

Troubleshooting

Monitor Not Detecting Accumulation

Check threshold settings (default: 50 connections)
Reduce check interval for faster detection
Use forceCaptureNow() to capture current state

Too Many False Positives

Increase accumulation threshold
Increase spike threshold
Adjust check interval

Missing Diagnostic Data

Ensure output directory exists and is writable
Check disk space
Verify process has write permissions

Next Steps

Deploy the monitor to production
Wait for accumulation to occur
Share diagnostic files for analysis
Apply targeted fixes based on patterns found

The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.

4.8 KiB Raw Permalink Blame History