Refactor socket handling plan to address server crashes, memory leaks, and race conditions

This commit is contained in:
Philipp Kunz 2025-06-01 13:01:24 +00:00
parent 37c87e8450
commit 9fdc2d5069

View File

@ -1,337 +1,148 @@
# SmartProxy Socket Cleanup Fix Plan # SmartProxy Socket Handling Fix Plan
Reread CLAUDE.md file for guidelines
## Problem Summary ## Problem Summary
The current socket cleanup implementation is too aggressive and closes long-lived connections prematurely. This affects: The SmartProxy server is experiencing critical issues:
- WebSocket connections in HTTPS passthrough 1. **Server crashes** due to unhandled socket connection errors (ECONNREFUSED)
- Long-lived HTTP connections (SSE, streaming) 2. **Memory leak** with steadily rising active connection count
- Database connections 3. **Race conditions** between socket creation and error handler attachment
- Any connection that should remain open for hours 4. **Orphaned sockets** when server connections fail
## Root Causes ## Root Causes
### 1. **Bilateral Socket Cleanup** ### 1. Delayed Error Handler Attachment
When one socket closes, both sockets are immediately destroyed: - Sockets created without immediate error handlers
```typescript - Error events can fire before handlers attached
// In createSocketCleanupHandler - Causes uncaught exceptions and server crashes
cleanupSocket(clientSocket, 'client');
cleanupSocket(serverSocket, 'server'); // Both destroyed together!
```
### 2. **Aggressive Timeout Handling** ### 2. Incomplete Cleanup Logic
Timeout events immediately trigger connection cleanup: - Client sockets not cleaned up when server connection fails
- Connection counter only decrements after BOTH sockets close
- Failed server connections leave orphaned client sockets
### 3. Missing Global Error Handlers
- No process-level uncaughtException handler
- No process-level unhandledRejection handler
- Any unhandled error crashes entire server
## Implementation Plan
### Phase 1: Prevent Server Crashes (Critical)
#### 1.1 Add Global Error Handlers
- [ ] Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
- [ ] Log errors with context before graceful shutdown
- [ ] Implement graceful shutdown sequence
#### 1.2 Fix Socket Creation Race Condition
- [ ] Modify socket creation to attach error handlers immediately
- [ ] Update all forwarding handlers (https-passthrough, http, etc.)
- [ ] Ensure error handlers attached in same tick as socket creation
### Phase 2: Fix Memory Leaks (High Priority)
#### 2.1 Fix Connection Cleanup Logic
- [ ] Clean up client socket immediately if server connection fails
- [ ] Decrement connection counter on any socket failure
- [ ] Implement proper cleanup for half-open connections
#### 2.2 Improve Socket Utils
- [ ] Create new utility function for safe socket creation with immediate error handling
- [ ] Update createIndependentSocketHandlers to handle immediate failures
- [ ] Add connection tracking debug utilities
### Phase 3: Comprehensive Testing (Important)
#### 3.1 Create Test Cases
- [ ] Test ECONNREFUSED scenario
- [ ] Test timeout handling
- [ ] Test half-open connections
- [ ] Test rapid connect/disconnect cycles
#### 3.2 Add Monitoring
- [ ] Add connection leak detection
- [ ] Add metrics for connection lifecycle
- [ ] Add debug logging for socket state transitions
## Detailed Implementation Steps
### Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)
```typescript ```typescript
socket.on('timeout', () => { // Add in constructor or start method
handleClose(`${prefix}_timeout`); // Destroys both sockets! process.on('uncaughtException', (error) => {
logger.log('error', 'Uncaught exception', { error });
// Graceful shutdown
});
process.on('unhandledRejection', (reason, promise) => {
logger.log('error', 'Unhandled rejection', { reason, promise });
}); });
``` ```
### 3. **Parity Check Forces Closure** ### Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)
If one socket closes but the other remains open for >2 minutes, connection is forcefully terminated:
```typescript ```typescript
if (record.outgoingClosedTime && export function createSocketWithErrorHandler(
!record.incoming.destroyed && options: net.NetConnectOpts,
now - record.outgoingClosedTime > 120000) { onError: (err: Error) => void
this.cleanupConnection(record, 'parity_check'); ): net.Socket {
const socket = net.connect(options);
socket.on('error', onError);
return socket;
} }
``` ```
### 4. **No Half-Open Connection Support** ### Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)
The proxy doesn't support TCP half-open connections where one side closes while the other continues sending. - Replace direct socket creation with safe creation
- Handle server connection failures immediately
- Clean up client socket on server connection failure
## Fix Implementation Plan ### Step 4: Fix Connection Counting
- Decrement on ANY socket close, not just when both close
- Track failed connections separately
- Add connection state tracking
### Phase 1: Fix Socket Cleanup (Prevent Premature Closure) ### Step 5: Update All Handlers
- [ ] https-passthrough-handler.ts
- [ ] http-handler.ts
- [ ] https-terminate-to-http-handler.ts
- [ ] https-terminate-to-https-handler.ts
- [ ] route-connection-handler.ts
#### 1.1 Modify `cleanupSocket()` to support graceful shutdown ## Success Criteria
```typescript
export interface CleanupOptions {
immediate?: boolean; // Force immediate destruction
allowDrain?: boolean; // Allow write buffer to drain
gracePeriod?: number; // Ms to wait before force close
}
export function cleanupSocket( 1. **No server crashes** on ECONNREFUSED or other socket errors
socket: Socket | TLSSocket | null, 2. **Active connections** remain stable (no steady increase)
socketName?: string, 3. **All sockets** properly cleaned up on errors
options: CleanupOptions = {} 4. **Memory usage** remains stable under load
): Promise<void> { 5. **Graceful handling** of all error scenarios
if (!socket || socket.destroyed) return Promise.resolve();
return new Promise<void>((resolve) => {
const cleanup = () => {
socket.removeAllListeners();
if (!socket.destroyed) {
socket.destroy();
}
resolve();
};
if (options.immediate) {
cleanup();
} else if (options.allowDrain && socket.writable) {
// Allow pending writes to complete
socket.end(() => cleanup());
// Force cleanup after grace period
if (options.gracePeriod) {
setTimeout(cleanup, options.gracePeriod);
}
} else {
cleanup();
}
});
}
```
#### 1.2 Implement Independent Socket Tracking ## Testing Plan
```typescript
export function createIndependentSocketHandlers(
clientSocket: Socket,
serverSocket: Socket,
onBothClosed: (reason: string) => void
): { cleanupClient: () => void, cleanupServer: () => void } {
let clientClosed = false;
let serverClosed = false;
let clientReason = '';
let serverReason = '';
const checkBothClosed = () => {
if (clientClosed && serverClosed) {
onBothClosed(`client: ${clientReason}, server: ${serverReason}`);
}
};
const cleanupClient = async (reason: string) => {
if (clientClosed) return;
clientClosed = true;
clientReason = reason;
// Allow server to continue if still active
if (!serverClosed && serverSocket.writable) {
// Half-close: stop reading from client, let server finish
clientSocket.pause();
clientSocket.unpipe(serverSocket);
await cleanupSocket(clientSocket, 'client', { allowDrain: true });
} else {
await cleanupSocket(clientSocket, 'client');
}
checkBothClosed();
};
const cleanupServer = async (reason: string) => {
if (serverClosed) return;
serverClosed = true;
serverReason = reason;
// Allow client to continue if still active
if (!clientClosed && clientSocket.writable) {
// Half-close: stop reading from server, let client finish
serverSocket.pause();
serverSocket.unpipe(clientSocket);
await cleanupSocket(serverSocket, 'server', { allowDrain: true });
} else {
await cleanupSocket(serverSocket, 'server');
}
checkBothClosed();
};
return { cleanupClient, cleanupServer };
}
```
### Phase 2: Fix Timeout Handling 1. Simulate ECONNREFUSED by targeting closed ports
2. Monitor active connection count over time
3. Stress test with rapid connections
4. Test with unreachable hosts
5. Test with slow/timing out connections
#### 2.1 Separate timeout handling from connection closure ## Rollback Plan
```typescript
export function setupSocketHandlers(
socket: Socket | TLSSocket,
handleClose: (reason: string) => void,
handleTimeout?: (socket: Socket) => void, // New optional handler
errorPrefix?: string
): void {
socket.on('error', (error) => {
const prefix = errorPrefix || 'Socket';
handleClose(`${prefix}_error: ${error.message}`);
});
socket.on('close', () => {
const prefix = errorPrefix || 'socket';
handleClose(`${prefix}_closed`);
});
socket.on('timeout', () => {
if (handleTimeout) {
handleTimeout(socket); // Custom timeout handling
} else {
// Default: just log, don't close
console.warn(`Socket timeout: ${errorPrefix || 'socket'}`);
}
});
}
```
#### 2.2 Update HTTPS passthrough handler If issues arise:
```typescript 1. Revert socket creation changes
// In https-passthrough-handler.ts 2. Keep global error handlers (they add safety)
const { cleanupClient, cleanupServer } = createIndependentSocketHandlers( 3. Add more detailed logging for debugging
clientSocket, 4. Implement fixes incrementally
serverSocket,
(reason) => {
this.emit(ForwardingHandlerEvents.DISCONNECTED, {
remoteAddress,
bytesSent,
bytesReceived,
reason
});
}
);
// Setup handlers with custom timeout handling ## Timeline
setupSocketHandlers(clientSocket, cleanupClient, (socket) => {
// Just reset timeout, don't close
socket.setTimeout(timeout);
}, 'client');
setupSocketHandlers(serverSocket, cleanupServer, (socket) => { - Phase 1: Immediate (prevents crashes)
// Just reset timeout, don't close - Phase 2: Within 24 hours (fixes leaks)
socket.setTimeout(timeout); - Phase 3: Within 48 hours (ensures stability)
}, 'server');
```
### Phase 3: Fix Connection Manager ## Notes
#### 3.1 Remove aggressive parity check - The race condition is the most critical issue
```typescript - Connection counting logic needs complete overhaul
// Remove or significantly increase the parity check timeout - Consider using a connection state machine for clarity
// From 2 minutes to 30 minutes for long-lived connections - Add connection lifecycle events for debugging
if (record.outgoingClosedTime &&
!record.incoming.destroyed &&
!record.connectionClosed &&
now - record.outgoingClosedTime > 1800000) { // 30 minutes
// Only close if no data activity
if (now - record.lastActivity > 600000) { // 10 minutes of inactivity
this.cleanupConnection(record, 'parity_check');
}
}
```
#### 3.2 Update cleanupConnection to check socket states
```typescript
public cleanupConnection(record: IConnectionRecord, reason: string = 'normal'): void {
if (!record.connectionClosed) {
record.connectionClosed = true;
// Only cleanup sockets that are actually closed or inactive
if (record.incoming && (!record.incoming.writable || record.incoming.destroyed)) {
cleanupSocket(record.incoming, `${record.id}-incoming`, { immediate: true });
}
if (record.outgoing && (!record.outgoing.writable || record.outgoing.destroyed)) {
cleanupSocket(record.outgoing, `${record.id}-outgoing`, { immediate: true });
}
// If either socket is still active, don't remove the record yet
if ((record.incoming && record.incoming.writable) ||
(record.outgoing && record.outgoing.writable)) {
record.connectionClosed = false; // Reset flag
return; // Don't finish cleanup
}
// Continue with full cleanup...
}
}
```
### Phase 4: Testing and Validation
#### 4.1 Test Cases to Implement
1. WebSocket connection should stay open for >1 hour
2. HTTP streaming response should continue after request closes
3. Half-open connections should work correctly
4. Verify no socket leaks with long-running connections
5. Test graceful shutdown with pending data
#### 4.2 Socket Leak Prevention
- Ensure all event listeners are tracked and removed
- Use WeakMap for socket metadata to prevent memory leaks
- Implement connection count monitoring
- Add periodic health checks for orphaned sockets
## Implementation Order
1. **Day 1**: Implement graceful `cleanupSocket()` and independent socket handlers
2. **Day 2**: Update all handlers to use new cleanup mechanism
3. **Day 3**: Fix timeout handling to not close connections
4. **Day 4**: Update connection manager parity check and cleanup logic
5. **Day 5**: Comprehensive testing and leak detection
## Configuration Changes
Add new options to SmartProxyOptions:
```typescript
interface ISmartProxyOptions {
// Existing options...
// New options for long-lived connections
socketCleanupGracePeriod?: number; // Default: 5000ms
allowHalfOpenConnections?: boolean; // Default: true
parityCheckTimeout?: number; // Default: 1800000ms (30 min)
timeoutBehavior?: 'close' | 'reset' | 'ignore'; // Default: 'reset'
}
```
## Success Metrics
1. WebSocket connections remain stable for 24+ hours
2. No premature connection closures reported
3. Memory usage remains stable (no socket leaks)
4. Half-open connections work correctly
5. Graceful shutdown completes within grace period
## Implementation Status: COMPLETED ✅
### Implemented Changes
1. **Modified `cleanupSocket()` in `socket-utils.ts`**
- Added `CleanupOptions` interface with `immediate`, `allowDrain`, and `gracePeriod` options
- Implemented graceful shutdown support with write buffer draining
2. **Created `createIndependentSocketHandlers()` in `socket-utils.ts`**
- Tracks socket states independently
- Supports half-open connections where one side can close while the other remains open
- Only triggers full cleanup when both sockets are closed
3. **Updated `setupSocketHandlers()` in `socket-utils.ts`**
- Added optional `handleTimeout` parameter to customize timeout behavior
- Prevents automatic connection closure on timeout events
4. **Updated HTTPS Passthrough Handler**
- Now uses `createIndependentSocketHandlers` for half-open support
- Custom timeout handling that resets timer instead of closing connection
- Manual data forwarding with backpressure handling
5. **Updated Connection Manager**
- Extended parity check from 2 minutes to 30 minutes
- Added activity check before closing (10 minutes of inactivity required)
- Modified cleanup to check socket states before destroying
6. **Updated Basic Forwarding in Route Connection Handler**
- Replaced simple `pipe()` with independent socket handlers
- Added manual data forwarding with backpressure support
- Removed bilateral close handlers to prevent premature cleanup
### Test Results
All tests passing:
- ✅ Long-lived connection test: Connection stayed open for 61+ seconds with periodic keep-alive
- ✅ Half-open connection test: One side closed while the other continued to send data
- ✅ No socket leaks or premature closures
### Notes
- The fix maintains backward compatibility
- No configuration changes required for existing deployments
- Long-lived connections now work correctly in both HTTPS passthrough and basic forwarding modes