smartproxy/readme.monitoring.md

# Production Connection Monitoring

This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.

## Quick Start

```typescript
import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';

// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds

// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()
```

## What Gets Captured

When accumulation is detected, the monitor saves a JSON file with:

### Connection Details
- Socket states (destroyed, readable, writable, readyState)
- Connection age and activity timestamps
- Data transfer statistics (bytes sent/received)
- Target host and port information
- Keep-alive status
- Event listener counts

### System State
- Memory usage
- Event loop lag
- Connection count trends
- Termination statistics

## Reading Diagnostic Files

Files are saved to `.nogit/connection-diagnostics/` with names like:
```
accumulation_2025-06-07T20-20-43-733Z_force_capture.json
```

### Key Fields to Check

1. **Socket States**
   ```json
   "incomingState": {
     "destroyed": false,
     "readable": true,
     "writable": true,
     "readyState": "open"
   }
   ```
   - Both destroyed = zombie connection
   - One destroyed = half-zombie
   - Both alive but old = potential stuck connection

2. **Data Transfer**
   ```json
   "bytesReceived": 36,
   "bytesSent": 0,
   "timeSinceLastActivity": 60000
   ```
   - No bytes sent back = stuck connection
   - High bytes but old = slow backend
   - No activity = idle connection

3. **Connection Flags**
   ```json
   "hasReceivedInitialData": false,
   "hasKeepAlive": true,
   "connectionClosed": false
   ```
   - hasReceivedInitialData=false on non-TLS = immediate routing
   - hasKeepAlive=true = extended timeout applies
   - connectionClosed=false = still tracked

## Common Patterns

### 1. Hanging Backend Pattern
```json
{
  "bytesReceived": 36,
  "bytesSent": 0,
  "age": 120000,
  "targetHost": "backend.example.com",
  "incomingState": { "destroyed": false },
  "outgoingState": { "destroyed": false }
}
```
**Fix**: The stuck connection detection (60s timeout) should clean these up.

### 2. Zombie Connection Pattern
```json
{
  "incomingState": { "destroyed": true },
  "outgoingState": { "destroyed": true },
  "connectionClosed": false
}
```
**Fix**: The zombie detection should clean these up within 30s.

### 3. Event Listener Leak Pattern
```json
{
  "incomingListeners": {
    "data": 15,
    "error": 20,
    "close": 18
  }
}
```
**Issue**: Event listeners accumulating, potential memory leak.

### 4. No Outgoing Socket Pattern
```json
{
  "outgoingState": { "exists": false },
  "connectionClosed": false,
  "age": 5000
}
```
**Issue**: Connection setup failed but cleanup didn't trigger.

## Forcing Diagnostic Capture

To capture current state immediately:
```typescript
monitor.forceCaptureNow();
```

This is useful when you notice accumulation starting.

## Automated Analysis

The monitor automatically analyzes patterns and logs:
- Zombie/half-zombie counts
- Stuck connection counts
- Old connection counts
- Memory usage
- Recommendations

## Integration Example

```typescript
// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';

async function startProxyWithMonitoring() {
  const proxy = new SmartProxy({
    // your config
  });
  
  await proxy.start();
  
  // Start monitoring
  const monitor = new ProductionConnectionMonitor(proxy);
  monitor.start(5000);
  
  // Optional: Capture on specific events
  process.on('SIGUSR1', () => {
    console.log('Manual diagnostic capture triggered');
    monitor.forceCaptureNow();
  });
  
  // Graceful shutdown
  process.on('SIGTERM', async () => {
    monitor.stop();
    await proxy.stop();
    process.exit(0);
  });
}
```

## Troubleshooting

### Monitor Not Detecting Accumulation
- Check threshold settings (default: 50 connections)
- Reduce check interval for faster detection
- Use forceCaptureNow() to capture current state

### Too Many False Positives
- Increase accumulation threshold
- Increase spike threshold
- Adjust check interval

### Missing Diagnostic Data
- Ensure output directory exists and is writable
- Check disk space
- Verify process has write permissions

## Next Steps

1. Deploy the monitor to production
2. Wait for accumulation to occur
3. Share diagnostic files for analysis
4. Apply targeted fixes based on patterns found

The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.
fix(connection): filter zombie connections part 2 2025-06-07 20:37:49 +00:00			`# Production Connection Monitoring`

			`This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.`

			`## Quick Start`

			```typescript
			`import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';`

			`// After starting your proxy`
			`const monitor = new ProductionConnectionMonitor(proxy);`
			`monitor.start(5000); // Check every 5 seconds`

			`// The monitor will automatically capture diagnostics when:`
			`// - Connections exceed 50 (default threshold)`
			`// - Sudden spike of 20+ connections occurs`
			`// - You manually call monitor.forceCaptureNow()`
			```

			`## What Gets Captured`

			`When accumulation is detected, the monitor saves a JSON file with:`

			`### Connection Details`
			`- Socket states (destroyed, readable, writable, readyState)`
			`- Connection age and activity timestamps`
			`- Data transfer statistics (bytes sent/received)`
			`- Target host and port information`
			`- Keep-alive status`
			`- Event listener counts`

			`### System State`
			`- Memory usage`
			`- Event loop lag`
			`- Connection count trends`
			`- Termination statistics`

			`## Reading Diagnostic Files`

			Files are saved to `.nogit/connection-diagnostics/` with names like:
			```
			`accumulation_2025-06-07T20-20-43-733Z_force_capture.json`
			```

			`### Key Fields to Check`

			`1. Socket States`
			```json
			`"incomingState": {`
			`"destroyed": false,`
			`"readable": true,`
			`"writable": true,`
			`"readyState": "open"`
			`}`
			```
			`- Both destroyed = zombie connection`
			`- One destroyed = half-zombie`
			`- Both alive but old = potential stuck connection`

			`2. Data Transfer`
			```json
			`"bytesReceived": 36,`
			`"bytesSent": 0,`
			`"timeSinceLastActivity": 60000`
			```
			`- No bytes sent back = stuck connection`
			`- High bytes but old = slow backend`
			`- No activity = idle connection`

			`3. Connection Flags`
			```json
			`"hasReceivedInitialData": false,`
			`"hasKeepAlive": true,`
			`"connectionClosed": false`
			```
			`- hasReceivedInitialData=false on non-TLS = immediate routing`
			`- hasKeepAlive=true = extended timeout applies`
			`- connectionClosed=false = still tracked`

			`## Common Patterns`

			`### 1. Hanging Backend Pattern`
			```json
			`{`
			`"bytesReceived": 36,`
			`"bytesSent": 0,`
			`"age": 120000,`
			`"targetHost": "backend.example.com",`
			`"incomingState": { "destroyed": false },`
			`"outgoingState": { "destroyed": false }`
			`}`
			```
			`Fix: The stuck connection detection (60s timeout) should clean these up.`

			`### 2. Zombie Connection Pattern`
			```json
			`{`
			`"incomingState": { "destroyed": true },`
			`"outgoingState": { "destroyed": true },`
			`"connectionClosed": false`
			`}`
			```
			`Fix: The zombie detection should clean these up within 30s.`

			`### 3. Event Listener Leak Pattern`
			```json
			`{`
			`"incomingListeners": {`
			`"data": 15,`
			`"error": 20,`
			`"close": 18`
			`}`
			`}`
			```
			`Issue: Event listeners accumulating, potential memory leak.`

			`### 4. No Outgoing Socket Pattern`
			```json
			`{`
			`"outgoingState": { "exists": false },`
			`"connectionClosed": false,`
			`"age": 5000`
			`}`
			```
			`Issue: Connection setup failed but cleanup didn't trigger.`

			`## Forcing Diagnostic Capture`

			`To capture current state immediately:`
			```typescript
			`monitor.forceCaptureNow();`
			```

			`This is useful when you notice accumulation starting.`

			`## Automated Analysis`

			`The monitor automatically analyzes patterns and logs:`
			`- Zombie/half-zombie counts`
			`- Stuck connection counts`
			`- Old connection counts`
			`- Memory usage`
			`- Recommendations`

			`## Integration Example`

			```typescript
			`// In your proxy startup script`
			`import { SmartProxy } from '@push.rocks/smartproxy';`
			`import ProductionConnectionMonitor from './production-connection-monitor.js';`

			`async function startProxyWithMonitoring() {`
			`const proxy = new SmartProxy({`
			`// your config`
			`});`

			`await proxy.start();`

			`// Start monitoring`
			`const monitor = new ProductionConnectionMonitor(proxy);`
			`monitor.start(5000);`

			`// Optional: Capture on specific events`
			`process.on('SIGUSR1', () => {`
			`console.log('Manual diagnostic capture triggered');`
			`monitor.forceCaptureNow();`
			`});`

			`// Graceful shutdown`
			`process.on('SIGTERM', async () => {`
			`monitor.stop();`
			`await proxy.stop();`
			`process.exit(0);`
			`});`
			`}`
			```

			`## Troubleshooting`

			`### Monitor Not Detecting Accumulation`
			`- Check threshold settings (default: 50 connections)`
			`- Reduce check interval for faster detection`
			`- Use forceCaptureNow() to capture current state`

			`### Too Many False Positives`
			`- Increase accumulation threshold`
			`- Increase spike threshold`
			`- Adjust check interval`

			`### Missing Diagnostic Data`
			`- Ensure output directory exists and is writable`
			`- Check disk space`
			`- Verify process has write permissions`

			`## Next Steps`

			`1. Deploy the monitor to production`
			`2. Wait for accumulation to occur`
			`3. Share diagnostic files for analysis`
			`4. Apply targeted fixes based on patterns found`

			`The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.`