fix(connection): filter zombie connections part 2

This commit is contained in:
Juergen Kunz
2025-06-07 20:37:49 +00:00
parent 19590ef107
commit 890e907664
4 changed files with 498 additions and 1 deletions

202
readme.monitoring.md Normal file
View File

@ -0,0 +1,202 @@
# Production Connection Monitoring
This document explains how to use the ProductionConnectionMonitor to diagnose connection accumulation issues in real-time.
## Quick Start
```typescript
import ProductionConnectionMonitor from './.nogit/debug/production-connection-monitor.js';
// After starting your proxy
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000); // Check every 5 seconds
// The monitor will automatically capture diagnostics when:
// - Connections exceed 50 (default threshold)
// - Sudden spike of 20+ connections occurs
// - You manually call monitor.forceCaptureNow()
```
## What Gets Captured
When accumulation is detected, the monitor saves a JSON file with:
### Connection Details
- Socket states (destroyed, readable, writable, readyState)
- Connection age and activity timestamps
- Data transfer statistics (bytes sent/received)
- Target host and port information
- Keep-alive status
- Event listener counts
### System State
- Memory usage
- Event loop lag
- Connection count trends
- Termination statistics
## Reading Diagnostic Files
Files are saved to `.nogit/connection-diagnostics/` with names like:
```
accumulation_2025-06-07T20-20-43-733Z_force_capture.json
```
### Key Fields to Check
1. **Socket States**
```json
"incomingState": {
"destroyed": false,
"readable": true,
"writable": true,
"readyState": "open"
}
```
- Both destroyed = zombie connection
- One destroyed = half-zombie
- Both alive but old = potential stuck connection
2. **Data Transfer**
```json
"bytesReceived": 36,
"bytesSent": 0,
"timeSinceLastActivity": 60000
```
- No bytes sent back = stuck connection
- High bytes but old = slow backend
- No activity = idle connection
3. **Connection Flags**
```json
"hasReceivedInitialData": false,
"hasKeepAlive": true,
"connectionClosed": false
```
- hasReceivedInitialData=false on non-TLS = immediate routing
- hasKeepAlive=true = extended timeout applies
- connectionClosed=false = still tracked
## Common Patterns
### 1. Hanging Backend Pattern
```json
{
"bytesReceived": 36,
"bytesSent": 0,
"age": 120000,
"targetHost": "backend.example.com",
"incomingState": { "destroyed": false },
"outgoingState": { "destroyed": false }
}
```
**Fix**: The stuck connection detection (60s timeout) should clean these up.
### 2. Zombie Connection Pattern
```json
{
"incomingState": { "destroyed": true },
"outgoingState": { "destroyed": true },
"connectionClosed": false
}
```
**Fix**: The zombie detection should clean these up within 30s.
### 3. Event Listener Leak Pattern
```json
{
"incomingListeners": {
"data": 15,
"error": 20,
"close": 18
}
}
```
**Issue**: Event listeners accumulating, potential memory leak.
### 4. No Outgoing Socket Pattern
```json
{
"outgoingState": { "exists": false },
"connectionClosed": false,
"age": 5000
}
```
**Issue**: Connection setup failed but cleanup didn't trigger.
## Forcing Diagnostic Capture
To capture current state immediately:
```typescript
monitor.forceCaptureNow();
```
This is useful when you notice accumulation starting.
## Automated Analysis
The monitor automatically analyzes patterns and logs:
- Zombie/half-zombie counts
- Stuck connection counts
- Old connection counts
- Memory usage
- Recommendations
## Integration Example
```typescript
// In your proxy startup script
import { SmartProxy } from '@push.rocks/smartproxy';
import ProductionConnectionMonitor from './production-connection-monitor.js';
async function startProxyWithMonitoring() {
const proxy = new SmartProxy({
// your config
});
await proxy.start();
// Start monitoring
const monitor = new ProductionConnectionMonitor(proxy);
monitor.start(5000);
// Optional: Capture on specific events
process.on('SIGUSR1', () => {
console.log('Manual diagnostic capture triggered');
monitor.forceCaptureNow();
});
// Graceful shutdown
process.on('SIGTERM', async () => {
monitor.stop();
await proxy.stop();
process.exit(0);
});
}
```
## Troubleshooting
### Monitor Not Detecting Accumulation
- Check threshold settings (default: 50 connections)
- Reduce check interval for faster detection
- Use forceCaptureNow() to capture current state
### Too Many False Positives
- Increase accumulation threshold
- Increase spike threshold
- Adjust check interval
### Missing Diagnostic Data
- Ensure output directory exists and is writable
- Check disk space
- Verify process has write permissions
## Next Steps
1. Deploy the monitor to production
2. Wait for accumulation to occur
3. Share diagnostic files for analysis
4. Apply targeted fixes based on patterns found
The diagnostic data will reveal the exact state of connections when accumulation occurs, enabling precise fixes for your specific scenario.