Refactor socket handling plan to address server crashes, memory leaks, and race conditions

This commit is contained in:
Philipp Kunz 2025-06-01 13:01:24 +00:00
parent 37c87e8450
commit 9fdc2d5069

View File

@ -1,337 +1,148 @@
# SmartProxy Socket Cleanup Fix Plan
# SmartProxy Socket Handling Fix Plan
Reread CLAUDE.md file for guidelines
## Problem Summary
The current socket cleanup implementation is too aggressive and closes long-lived connections prematurely. This affects:
- WebSocket connections in HTTPS passthrough
- Long-lived HTTP connections (SSE, streaming)
- Database connections
- Any connection that should remain open for hours
The SmartProxy server is experiencing critical issues:
1. **Server crashes** due to unhandled socket connection errors (ECONNREFUSED)
2. **Memory leak** with steadily rising active connection count
3. **Race conditions** between socket creation and error handler attachment
4. **Orphaned sockets** when server connections fail
## Root Causes
### 1. **Bilateral Socket Cleanup**
When one socket closes, both sockets are immediately destroyed:
```typescript
// In createSocketCleanupHandler
cleanupSocket(clientSocket, 'client');
cleanupSocket(serverSocket, 'server'); // Both destroyed together!
```
### 1. Delayed Error Handler Attachment
- Sockets created without immediate error handlers
- Error events can fire before handlers attached
- Causes uncaught exceptions and server crashes
### 2. **Aggressive Timeout Handling**
Timeout events immediately trigger connection cleanup:
### 2. Incomplete Cleanup Logic
- Client sockets not cleaned up when server connection fails
- Connection counter only decrements after BOTH sockets close
- Failed server connections leave orphaned client sockets
### 3. Missing Global Error Handlers
- No process-level uncaughtException handler
- No process-level unhandledRejection handler
- Any unhandled error crashes entire server
## Implementation Plan
### Phase 1: Prevent Server Crashes (Critical)
#### 1.1 Add Global Error Handlers
- [ ] Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
- [ ] Log errors with context before graceful shutdown
- [ ] Implement graceful shutdown sequence
#### 1.2 Fix Socket Creation Race Condition
- [ ] Modify socket creation to attach error handlers immediately
- [ ] Update all forwarding handlers (https-passthrough, http, etc.)
- [ ] Ensure error handlers attached in same tick as socket creation
### Phase 2: Fix Memory Leaks (High Priority)
#### 2.1 Fix Connection Cleanup Logic
- [ ] Clean up client socket immediately if server connection fails
- [ ] Decrement connection counter on any socket failure
- [ ] Implement proper cleanup for half-open connections
#### 2.2 Improve Socket Utils
- [ ] Create new utility function for safe socket creation with immediate error handling
- [ ] Update createIndependentSocketHandlers to handle immediate failures
- [ ] Add connection tracking debug utilities
### Phase 3: Comprehensive Testing (Important)
#### 3.1 Create Test Cases
- [ ] Test ECONNREFUSED scenario
- [ ] Test timeout handling
- [ ] Test half-open connections
- [ ] Test rapid connect/disconnect cycles
#### 3.2 Add Monitoring
- [ ] Add connection leak detection
- [ ] Add metrics for connection lifecycle
- [ ] Add debug logging for socket state transitions
## Detailed Implementation Steps
### Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)
```typescript
socket.on('timeout', () => {
handleClose(`${prefix}_timeout`); // Destroys both sockets!
// Add in constructor or start method
process.on('uncaughtException', (error) => {
logger.log('error', 'Uncaught exception', { error });
// Graceful shutdown
});
process.on('unhandledRejection', (reason, promise) => {
logger.log('error', 'Unhandled rejection', { reason, promise });
});
```
### 3. **Parity Check Forces Closure**
If one socket closes but the other remains open for >2 minutes, connection is forcefully terminated:
### Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)
```typescript
if (record.outgoingClosedTime &&
!record.incoming.destroyed &&
now - record.outgoingClosedTime > 120000) {
this.cleanupConnection(record, 'parity_check');
export function createSocketWithErrorHandler(
options: net.NetConnectOpts,
onError: (err: Error) => void
): net.Socket {
const socket = net.connect(options);
socket.on('error', onError);
return socket;
}
```
### 4. **No Half-Open Connection Support**
The proxy doesn't support TCP half-open connections where one side closes while the other continues sending.
### Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)
- Replace direct socket creation with safe creation
- Handle server connection failures immediately
- Clean up client socket on server connection failure
## Fix Implementation Plan
### Step 4: Fix Connection Counting
- Decrement on ANY socket close, not just when both close
- Track failed connections separately
- Add connection state tracking
### Phase 1: Fix Socket Cleanup (Prevent Premature Closure)
### Step 5: Update All Handlers
- [ ] https-passthrough-handler.ts
- [ ] http-handler.ts
- [ ] https-terminate-to-http-handler.ts
- [ ] https-terminate-to-https-handler.ts
- [ ] route-connection-handler.ts
#### 1.1 Modify `cleanupSocket()` to support graceful shutdown
```typescript
export interface CleanupOptions {
immediate?: boolean; // Force immediate destruction
allowDrain?: boolean; // Allow write buffer to drain
gracePeriod?: number; // Ms to wait before force close
}
## Success Criteria
export function cleanupSocket(
socket: Socket | TLSSocket | null,
socketName?: string,
options: CleanupOptions = {}
): Promise<void> {
if (!socket || socket.destroyed) return Promise.resolve();
return new Promise<void>((resolve) => {
const cleanup = () => {
socket.removeAllListeners();
if (!socket.destroyed) {
socket.destroy();
}
resolve();
};
if (options.immediate) {
cleanup();
} else if (options.allowDrain && socket.writable) {
// Allow pending writes to complete
socket.end(() => cleanup());
// Force cleanup after grace period
if (options.gracePeriod) {
setTimeout(cleanup, options.gracePeriod);
}
} else {
cleanup();
}
});
}
```
1. **No server crashes** on ECONNREFUSED or other socket errors
2. **Active connections** remain stable (no steady increase)
3. **All sockets** properly cleaned up on errors
4. **Memory usage** remains stable under load
5. **Graceful handling** of all error scenarios
#### 1.2 Implement Independent Socket Tracking
```typescript
export function createIndependentSocketHandlers(
clientSocket: Socket,
serverSocket: Socket,
onBothClosed: (reason: string) => void
): { cleanupClient: () => void, cleanupServer: () => void } {
let clientClosed = false;
let serverClosed = false;
let clientReason = '';
let serverReason = '';
const checkBothClosed = () => {
if (clientClosed && serverClosed) {
onBothClosed(`client: ${clientReason}, server: ${serverReason}`);
}
};
const cleanupClient = async (reason: string) => {
if (clientClosed) return;
clientClosed = true;
clientReason = reason;
// Allow server to continue if still active
if (!serverClosed && serverSocket.writable) {
// Half-close: stop reading from client, let server finish
clientSocket.pause();
clientSocket.unpipe(serverSocket);
await cleanupSocket(clientSocket, 'client', { allowDrain: true });
} else {
await cleanupSocket(clientSocket, 'client');
}
checkBothClosed();
};
const cleanupServer = async (reason: string) => {
if (serverClosed) return;
serverClosed = true;
serverReason = reason;
// Allow client to continue if still active
if (!clientClosed && clientSocket.writable) {
// Half-close: stop reading from server, let client finish
serverSocket.pause();
serverSocket.unpipe(clientSocket);
await cleanupSocket(serverSocket, 'server', { allowDrain: true });
} else {
await cleanupSocket(serverSocket, 'server');
}
checkBothClosed();
};
return { cleanupClient, cleanupServer };
}
```
## Testing Plan
### Phase 2: Fix Timeout Handling
1. Simulate ECONNREFUSED by targeting closed ports
2. Monitor active connection count over time
3. Stress test with rapid connections
4. Test with unreachable hosts
5. Test with slow/timing out connections
#### 2.1 Separate timeout handling from connection closure
```typescript
export function setupSocketHandlers(
socket: Socket | TLSSocket,
handleClose: (reason: string) => void,
handleTimeout?: (socket: Socket) => void, // New optional handler
errorPrefix?: string
): void {
socket.on('error', (error) => {
const prefix = errorPrefix || 'Socket';
handleClose(`${prefix}_error: ${error.message}`);
});
socket.on('close', () => {
const prefix = errorPrefix || 'socket';
handleClose(`${prefix}_closed`);
});
socket.on('timeout', () => {
if (handleTimeout) {
handleTimeout(socket); // Custom timeout handling
} else {
// Default: just log, don't close
console.warn(`Socket timeout: ${errorPrefix || 'socket'}`);
}
});
}
```
## Rollback Plan
#### 2.2 Update HTTPS passthrough handler
```typescript
// In https-passthrough-handler.ts
const { cleanupClient, cleanupServer } = createIndependentSocketHandlers(
clientSocket,
serverSocket,
(reason) => {
this.emit(ForwardingHandlerEvents.DISCONNECTED, {
remoteAddress,
bytesSent,
bytesReceived,
reason
});
}
);
If issues arise:
1. Revert socket creation changes
2. Keep global error handlers (they add safety)
3. Add more detailed logging for debugging
4. Implement fixes incrementally
// Setup handlers with custom timeout handling
setupSocketHandlers(clientSocket, cleanupClient, (socket) => {
// Just reset timeout, don't close
socket.setTimeout(timeout);
}, 'client');
## Timeline
setupSocketHandlers(serverSocket, cleanupServer, (socket) => {
// Just reset timeout, don't close
socket.setTimeout(timeout);
}, 'server');
```
- Phase 1: Immediate (prevents crashes)
- Phase 2: Within 24 hours (fixes leaks)
- Phase 3: Within 48 hours (ensures stability)
### Phase 3: Fix Connection Manager
## Notes
#### 3.1 Remove aggressive parity check
```typescript
// Remove or significantly increase the parity check timeout
// From 2 minutes to 30 minutes for long-lived connections
if (record.outgoingClosedTime &&
!record.incoming.destroyed &&
!record.connectionClosed &&
now - record.outgoingClosedTime > 1800000) { // 30 minutes
// Only close if no data activity
if (now - record.lastActivity > 600000) { // 10 minutes of inactivity
this.cleanupConnection(record, 'parity_check');
}
}
```
#### 3.2 Update cleanupConnection to check socket states
```typescript
public cleanupConnection(record: IConnectionRecord, reason: string = 'normal'): void {
if (!record.connectionClosed) {
record.connectionClosed = true;
// Only cleanup sockets that are actually closed or inactive
if (record.incoming && (!record.incoming.writable || record.incoming.destroyed)) {
cleanupSocket(record.incoming, `${record.id}-incoming`, { immediate: true });
}
if (record.outgoing && (!record.outgoing.writable || record.outgoing.destroyed)) {
cleanupSocket(record.outgoing, `${record.id}-outgoing`, { immediate: true });
}
// If either socket is still active, don't remove the record yet
if ((record.incoming && record.incoming.writable) ||
(record.outgoing && record.outgoing.writable)) {
record.connectionClosed = false; // Reset flag
return; // Don't finish cleanup
}
// Continue with full cleanup...
}
}
```
### Phase 4: Testing and Validation
#### 4.1 Test Cases to Implement
1. WebSocket connection should stay open for >1 hour
2. HTTP streaming response should continue after request closes
3. Half-open connections should work correctly
4. Verify no socket leaks with long-running connections
5. Test graceful shutdown with pending data
#### 4.2 Socket Leak Prevention
- Ensure all event listeners are tracked and removed
- Use WeakMap for socket metadata to prevent memory leaks
- Implement connection count monitoring
- Add periodic health checks for orphaned sockets
## Implementation Order
1. **Day 1**: Implement graceful `cleanupSocket()` and independent socket handlers
2. **Day 2**: Update all handlers to use new cleanup mechanism
3. **Day 3**: Fix timeout handling to not close connections
4. **Day 4**: Update connection manager parity check and cleanup logic
5. **Day 5**: Comprehensive testing and leak detection
## Configuration Changes
Add new options to SmartProxyOptions:
```typescript
interface ISmartProxyOptions {
// Existing options...
// New options for long-lived connections
socketCleanupGracePeriod?: number; // Default: 5000ms
allowHalfOpenConnections?: boolean; // Default: true
parityCheckTimeout?: number; // Default: 1800000ms (30 min)
timeoutBehavior?: 'close' | 'reset' | 'ignore'; // Default: 'reset'
}
```
## Success Metrics
1. WebSocket connections remain stable for 24+ hours
2. No premature connection closures reported
3. Memory usage remains stable (no socket leaks)
4. Half-open connections work correctly
5. Graceful shutdown completes within grace period
## Implementation Status: COMPLETED ✅
### Implemented Changes
1. **Modified `cleanupSocket()` in `socket-utils.ts`**
- Added `CleanupOptions` interface with `immediate`, `allowDrain`, and `gracePeriod` options
- Implemented graceful shutdown support with write buffer draining
2. **Created `createIndependentSocketHandlers()` in `socket-utils.ts`**
- Tracks socket states independently
- Supports half-open connections where one side can close while the other remains open
- Only triggers full cleanup when both sockets are closed
3. **Updated `setupSocketHandlers()` in `socket-utils.ts`**
- Added optional `handleTimeout` parameter to customize timeout behavior
- Prevents automatic connection closure on timeout events
4. **Updated HTTPS Passthrough Handler**
- Now uses `createIndependentSocketHandlers` for half-open support
- Custom timeout handling that resets timer instead of closing connection
- Manual data forwarding with backpressure handling
5. **Updated Connection Manager**
- Extended parity check from 2 minutes to 30 minutes
- Added activity check before closing (10 minutes of inactivity required)
- Modified cleanup to check socket states before destroying
6. **Updated Basic Forwarding in Route Connection Handler**
- Replaced simple `pipe()` with independent socket handlers
- Added manual data forwarding with backpressure support
- Removed bilateral close handlers to prevent premature cleanup
### Test Results
All tests passing:
- ✅ Long-lived connection test: Connection stayed open for 61+ seconds with periodic keep-alive
- ✅ Half-open connection test: One side closed while the other continued to send data
- ✅ No socket leaks or premature closures
### Notes
- The fix maintains backward compatibility
- No configuration changes required for existing deployments
- Long-lived connections now work correctly in both HTTPS passthrough and basic forwarding modes
- The race condition is the most critical issue
- Connection counting logic needs complete overhaul
- Consider using a connection state machine for clarity
- Add connection lifecycle events for debugging