Files
smartproxy/readme.monitoring.md
2025-06-07 20:37:49 +00:00

4.8 KiB

Production Connection Monitoring

This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.

Quick Start

import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';

// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds

// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()

What Gets Captured

When accumulation is detected, the monitor saves a JSON file with:

Connection Details

  • Socket states (destroyed, readable, writable, readyState)
  • Connection age and activity timestamps
  • Data transfer statistics (bytes sent/received)
  • Target host and port information
  • Keep-alive status
  • Event listener counts

System State

  • Memory usage
  • Event loop lag
  • Connection count trends
  • Termination statistics

Reading Diagnostic Files

Files are saved to .nogit/connection-diagnostics/ with names like:

accumulation_2025-06-07T20-20-43-733Z_force_capture.json

Key Fields to Check

  1. Socket States

    "incomingState": {
      "destroyed": false,
      "readable": true,
      "writable": true,
      "readyState": "open"
    }
    
    • Both destroyed = zombie connection
    • One destroyed = half-zombie
    • Both alive but old = potential stuck connection
  2. Data Transfer

    "bytesReceived": 36,
    "bytesSent": 0,
    "timeSinceLastActivity": 60000
    
    • No bytes sent back = stuck connection
    • High bytes but old = slow backend
    • No activity = idle connection
  3. Connection Flags

    "hasReceivedInitialData": false,
    "hasKeepAlive": true,
    "connectionClosed": false
    
    • hasReceivedInitialData=false on non-TLS = immediate routing
    • hasKeepAlive=true = extended timeout applies
    • connectionClosed=false = still tracked

Common Patterns

1. Hanging Backend Pattern

{
  "bytesReceived": 36,
  "bytesSent": 0,
  "age": 120000,
  "targetHost": "backend.example.com",
  "incomingState": { "destroyed": false },
  "outgoingState": { "destroyed": false }
}

Fix: The stuck connection detection (60s timeout) should clean these up.

2. Zombie Connection Pattern

{
  "incomingState": { "destroyed": true },
  "outgoingState": { "destroyed": true },
  "connectionClosed": false
}

Fix: The zombie detection should clean these up within 30s.

3. Event Listener Leak Pattern

{
  "incomingListeners": {
    "data": 15,
    "error": 20,
    "close": 18
  }
}

Issue: Event listeners accumulating, potential memory leak.

4. No Outgoing Socket Pattern

{
  "outgoingState": { "exists": false },
  "connectionClosed": false,
  "age": 5000
}

Issue: Connection setup failed but cleanup didn't trigger.

Forcing Diagnostic Capture

To capture current state immediately:

monitor.forceCaptureNow();

This is useful when you notice accumulation starting.

Automated Analysis

The monitor automatically analyzes patterns and logs:

  • Zombie/half-zombie counts
  • Stuck connection counts
  • Old connection counts
  • Memory usage
  • Recommendations

Integration Example

// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';

async function startProxyWithMonitoring() {
  const proxy = new SmartProxy({
    // your config
  });
  
  await proxy.start();
  
  // Start monitoring
  const monitor = new ProductionConnectionMonitor(proxy);
  monitor.start(5000);
  
  // Optional: Capture on specific events
  process.on('SIGUSR1', () => {
    console.log('Manual diagnostic capture triggered');
    monitor.forceCaptureNow();
  });
  
  // Graceful shutdown
  process.on('SIGTERM', async () => {
    monitor.stop();
    await proxy.stop();
    process.exit(0);
  });
}

Troubleshooting

Monitor Not Detecting Accumulation

  • Check threshold settings (default: 50 connections)
  • Reduce check interval for faster detection
  • Use forceCaptureNow() to capture current state

Too Many False Positives

  • Increase accumulation threshold
  • Increase spike threshold
  • Adjust check interval

Missing Diagnostic Data

  • Ensure output directory exists and is writable
  • Check disk space
  • Verify process has write permissions

Next Steps

  1. Deploy the monitor to production
  2. Wait for accumulation to occur
  3. Share diagnostic files for analysis
  4. Apply targeted fixes based on patterns found

The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.