Refactor socket handling plan to address server crashes, memory leaks, and race conditions
This commit is contained in:
parent
37c87e8450
commit
9fdc2d5069
431
readme.plan.md
431
readme.plan.md
@ -1,337 +1,148 @@
|
||||
# SmartProxy Socket Cleanup Fix Plan
|
||||
# SmartProxy Socket Handling Fix Plan
|
||||
|
||||
Reread CLAUDE.md file for guidelines
|
||||
|
||||
## Problem Summary
|
||||
|
||||
The current socket cleanup implementation is too aggressive and closes long-lived connections prematurely. This affects:
|
||||
- WebSocket connections in HTTPS passthrough
|
||||
- Long-lived HTTP connections (SSE, streaming)
|
||||
- Database connections
|
||||
- Any connection that should remain open for hours
|
||||
The SmartProxy server is experiencing critical issues:
|
||||
1. **Server crashes** due to unhandled socket connection errors (ECONNREFUSED)
|
||||
2. **Memory leak** with steadily rising active connection count
|
||||
3. **Race conditions** between socket creation and error handler attachment
|
||||
4. **Orphaned sockets** when server connections fail
|
||||
|
||||
## Root Causes
|
||||
|
||||
### 1. **Bilateral Socket Cleanup**
|
||||
When one socket closes, both sockets are immediately destroyed:
|
||||
```typescript
|
||||
// In createSocketCleanupHandler
|
||||
cleanupSocket(clientSocket, 'client');
|
||||
cleanupSocket(serverSocket, 'server'); // Both destroyed together!
|
||||
```
|
||||
### 1. Delayed Error Handler Attachment
|
||||
- Sockets created without immediate error handlers
|
||||
- Error events can fire before handlers attached
|
||||
- Causes uncaught exceptions and server crashes
|
||||
|
||||
### 2. **Aggressive Timeout Handling**
|
||||
Timeout events immediately trigger connection cleanup:
|
||||
### 2. Incomplete Cleanup Logic
|
||||
- Client sockets not cleaned up when server connection fails
|
||||
- Connection counter only decrements after BOTH sockets close
|
||||
- Failed server connections leave orphaned client sockets
|
||||
|
||||
### 3. Missing Global Error Handlers
|
||||
- No process-level uncaughtException handler
|
||||
- No process-level unhandledRejection handler
|
||||
- Any unhandled error crashes entire server
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Prevent Server Crashes (Critical)
|
||||
|
||||
#### 1.1 Add Global Error Handlers
|
||||
- [ ] Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
|
||||
- [ ] Log errors with context before graceful shutdown
|
||||
- [ ] Implement graceful shutdown sequence
|
||||
|
||||
#### 1.2 Fix Socket Creation Race Condition
|
||||
- [ ] Modify socket creation to attach error handlers immediately
|
||||
- [ ] Update all forwarding handlers (https-passthrough, http, etc.)
|
||||
- [ ] Ensure error handlers attached in same tick as socket creation
|
||||
|
||||
### Phase 2: Fix Memory Leaks (High Priority)
|
||||
|
||||
#### 2.1 Fix Connection Cleanup Logic
|
||||
- [ ] Clean up client socket immediately if server connection fails
|
||||
- [ ] Decrement connection counter on any socket failure
|
||||
- [ ] Implement proper cleanup for half-open connections
|
||||
|
||||
#### 2.2 Improve Socket Utils
|
||||
- [ ] Create new utility function for safe socket creation with immediate error handling
|
||||
- [ ] Update createIndependentSocketHandlers to handle immediate failures
|
||||
- [ ] Add connection tracking debug utilities
|
||||
|
||||
### Phase 3: Comprehensive Testing (Important)
|
||||
|
||||
#### 3.1 Create Test Cases
|
||||
- [ ] Test ECONNREFUSED scenario
|
||||
- [ ] Test timeout handling
|
||||
- [ ] Test half-open connections
|
||||
- [ ] Test rapid connect/disconnect cycles
|
||||
|
||||
#### 3.2 Add Monitoring
|
||||
- [ ] Add connection leak detection
|
||||
- [ ] Add metrics for connection lifecycle
|
||||
- [ ] Add debug logging for socket state transitions
|
||||
|
||||
## Detailed Implementation Steps
|
||||
|
||||
### Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)
|
||||
```typescript
|
||||
socket.on('timeout', () => {
|
||||
handleClose(`${prefix}_timeout`); // Destroys both sockets!
|
||||
// Add in constructor or start method
|
||||
process.on('uncaughtException', (error) => {
|
||||
logger.log('error', 'Uncaught exception', { error });
|
||||
// Graceful shutdown
|
||||
});
|
||||
|
||||
process.on('unhandledRejection', (reason, promise) => {
|
||||
logger.log('error', 'Unhandled rejection', { reason, promise });
|
||||
});
|
||||
```
|
||||
|
||||
### 3. **Parity Check Forces Closure**
|
||||
If one socket closes but the other remains open for >2 minutes, connection is forcefully terminated:
|
||||
### Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)
|
||||
```typescript
|
||||
if (record.outgoingClosedTime &&
|
||||
!record.incoming.destroyed &&
|
||||
now - record.outgoingClosedTime > 120000) {
|
||||
this.cleanupConnection(record, 'parity_check');
|
||||
export function createSocketWithErrorHandler(
|
||||
options: net.NetConnectOpts,
|
||||
onError: (err: Error) => void
|
||||
): net.Socket {
|
||||
const socket = net.connect(options);
|
||||
socket.on('error', onError);
|
||||
return socket;
|
||||
}
|
||||
```
|
||||
|
||||
### 4. **No Half-Open Connection Support**
|
||||
The proxy doesn't support TCP half-open connections where one side closes while the other continues sending.
|
||||
### Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)
|
||||
- Replace direct socket creation with safe creation
|
||||
- Handle server connection failures immediately
|
||||
- Clean up client socket on server connection failure
|
||||
|
||||
## Fix Implementation Plan
|
||||
### Step 4: Fix Connection Counting
|
||||
- Decrement on ANY socket close, not just when both close
|
||||
- Track failed connections separately
|
||||
- Add connection state tracking
|
||||
|
||||
### Phase 1: Fix Socket Cleanup (Prevent Premature Closure)
|
||||
### Step 5: Update All Handlers
|
||||
- [ ] https-passthrough-handler.ts
|
||||
- [ ] http-handler.ts
|
||||
- [ ] https-terminate-to-http-handler.ts
|
||||
- [ ] https-terminate-to-https-handler.ts
|
||||
- [ ] route-connection-handler.ts
|
||||
|
||||
#### 1.1 Modify `cleanupSocket()` to support graceful shutdown
|
||||
```typescript
|
||||
export interface CleanupOptions {
|
||||
immediate?: boolean; // Force immediate destruction
|
||||
allowDrain?: boolean; // Allow write buffer to drain
|
||||
gracePeriod?: number; // Ms to wait before force close
|
||||
}
|
||||
## Success Criteria
|
||||
|
||||
export function cleanupSocket(
|
||||
socket: Socket | TLSSocket | null,
|
||||
socketName?: string,
|
||||
options: CleanupOptions = {}
|
||||
): Promise<void> {
|
||||
if (!socket || socket.destroyed) return Promise.resolve();
|
||||
|
||||
return new Promise<void>((resolve) => {
|
||||
const cleanup = () => {
|
||||
socket.removeAllListeners();
|
||||
if (!socket.destroyed) {
|
||||
socket.destroy();
|
||||
}
|
||||
resolve();
|
||||
};
|
||||
|
||||
if (options.immediate) {
|
||||
cleanup();
|
||||
} else if (options.allowDrain && socket.writable) {
|
||||
// Allow pending writes to complete
|
||||
socket.end(() => cleanup());
|
||||
|
||||
// Force cleanup after grace period
|
||||
if (options.gracePeriod) {
|
||||
setTimeout(cleanup, options.gracePeriod);
|
||||
}
|
||||
} else {
|
||||
cleanup();
|
||||
}
|
||||
});
|
||||
}
|
||||
```
|
||||
1. **No server crashes** on ECONNREFUSED or other socket errors
|
||||
2. **Active connections** remain stable (no steady increase)
|
||||
3. **All sockets** properly cleaned up on errors
|
||||
4. **Memory usage** remains stable under load
|
||||
5. **Graceful handling** of all error scenarios
|
||||
|
||||
#### 1.2 Implement Independent Socket Tracking
|
||||
```typescript
|
||||
export function createIndependentSocketHandlers(
|
||||
clientSocket: Socket,
|
||||
serverSocket: Socket,
|
||||
onBothClosed: (reason: string) => void
|
||||
): { cleanupClient: () => void, cleanupServer: () => void } {
|
||||
let clientClosed = false;
|
||||
let serverClosed = false;
|
||||
let clientReason = '';
|
||||
let serverReason = '';
|
||||
|
||||
const checkBothClosed = () => {
|
||||
if (clientClosed && serverClosed) {
|
||||
onBothClosed(`client: ${clientReason}, server: ${serverReason}`);
|
||||
}
|
||||
};
|
||||
|
||||
const cleanupClient = async (reason: string) => {
|
||||
if (clientClosed) return;
|
||||
clientClosed = true;
|
||||
clientReason = reason;
|
||||
|
||||
// Allow server to continue if still active
|
||||
if (!serverClosed && serverSocket.writable) {
|
||||
// Half-close: stop reading from client, let server finish
|
||||
clientSocket.pause();
|
||||
clientSocket.unpipe(serverSocket);
|
||||
await cleanupSocket(clientSocket, 'client', { allowDrain: true });
|
||||
} else {
|
||||
await cleanupSocket(clientSocket, 'client');
|
||||
}
|
||||
|
||||
checkBothClosed();
|
||||
};
|
||||
|
||||
const cleanupServer = async (reason: string) => {
|
||||
if (serverClosed) return;
|
||||
serverClosed = true;
|
||||
serverReason = reason;
|
||||
|
||||
// Allow client to continue if still active
|
||||
if (!clientClosed && clientSocket.writable) {
|
||||
// Half-close: stop reading from server, let client finish
|
||||
serverSocket.pause();
|
||||
serverSocket.unpipe(clientSocket);
|
||||
await cleanupSocket(serverSocket, 'server', { allowDrain: true });
|
||||
} else {
|
||||
await cleanupSocket(serverSocket, 'server');
|
||||
}
|
||||
|
||||
checkBothClosed();
|
||||
};
|
||||
|
||||
return { cleanupClient, cleanupServer };
|
||||
}
|
||||
```
|
||||
## Testing Plan
|
||||
|
||||
### Phase 2: Fix Timeout Handling
|
||||
1. Simulate ECONNREFUSED by targeting closed ports
|
||||
2. Monitor active connection count over time
|
||||
3. Stress test with rapid connections
|
||||
4. Test with unreachable hosts
|
||||
5. Test with slow/timing out connections
|
||||
|
||||
#### 2.1 Separate timeout handling from connection closure
|
||||
```typescript
|
||||
export function setupSocketHandlers(
|
||||
socket: Socket | TLSSocket,
|
||||
handleClose: (reason: string) => void,
|
||||
handleTimeout?: (socket: Socket) => void, // New optional handler
|
||||
errorPrefix?: string
|
||||
): void {
|
||||
socket.on('error', (error) => {
|
||||
const prefix = errorPrefix || 'Socket';
|
||||
handleClose(`${prefix}_error: ${error.message}`);
|
||||
});
|
||||
|
||||
socket.on('close', () => {
|
||||
const prefix = errorPrefix || 'socket';
|
||||
handleClose(`${prefix}_closed`);
|
||||
});
|
||||
|
||||
socket.on('timeout', () => {
|
||||
if (handleTimeout) {
|
||||
handleTimeout(socket); // Custom timeout handling
|
||||
} else {
|
||||
// Default: just log, don't close
|
||||
console.warn(`Socket timeout: ${errorPrefix || 'socket'}`);
|
||||
}
|
||||
});
|
||||
}
|
||||
```
|
||||
## Rollback Plan
|
||||
|
||||
#### 2.2 Update HTTPS passthrough handler
|
||||
```typescript
|
||||
// In https-passthrough-handler.ts
|
||||
const { cleanupClient, cleanupServer } = createIndependentSocketHandlers(
|
||||
clientSocket,
|
||||
serverSocket,
|
||||
(reason) => {
|
||||
this.emit(ForwardingHandlerEvents.DISCONNECTED, {
|
||||
remoteAddress,
|
||||
bytesSent,
|
||||
bytesReceived,
|
||||
reason
|
||||
});
|
||||
}
|
||||
);
|
||||
If issues arise:
|
||||
1. Revert socket creation changes
|
||||
2. Keep global error handlers (they add safety)
|
||||
3. Add more detailed logging for debugging
|
||||
4. Implement fixes incrementally
|
||||
|
||||
// Setup handlers with custom timeout handling
|
||||
setupSocketHandlers(clientSocket, cleanupClient, (socket) => {
|
||||
// Just reset timeout, don't close
|
||||
socket.setTimeout(timeout);
|
||||
}, 'client');
|
||||
## Timeline
|
||||
|
||||
setupSocketHandlers(serverSocket, cleanupServer, (socket) => {
|
||||
// Just reset timeout, don't close
|
||||
socket.setTimeout(timeout);
|
||||
}, 'server');
|
||||
```
|
||||
- Phase 1: Immediate (prevents crashes)
|
||||
- Phase 2: Within 24 hours (fixes leaks)
|
||||
- Phase 3: Within 48 hours (ensures stability)
|
||||
|
||||
### Phase 3: Fix Connection Manager
|
||||
## Notes
|
||||
|
||||
#### 3.1 Remove aggressive parity check
|
||||
```typescript
|
||||
// Remove or significantly increase the parity check timeout
|
||||
// From 2 minutes to 30 minutes for long-lived connections
|
||||
if (record.outgoingClosedTime &&
|
||||
!record.incoming.destroyed &&
|
||||
!record.connectionClosed &&
|
||||
now - record.outgoingClosedTime > 1800000) { // 30 minutes
|
||||
// Only close if no data activity
|
||||
if (now - record.lastActivity > 600000) { // 10 minutes of inactivity
|
||||
this.cleanupConnection(record, 'parity_check');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 3.2 Update cleanupConnection to check socket states
|
||||
```typescript
|
||||
public cleanupConnection(record: IConnectionRecord, reason: string = 'normal'): void {
|
||||
if (!record.connectionClosed) {
|
||||
record.connectionClosed = true;
|
||||
|
||||
// Only cleanup sockets that are actually closed or inactive
|
||||
if (record.incoming && (!record.incoming.writable || record.incoming.destroyed)) {
|
||||
cleanupSocket(record.incoming, `${record.id}-incoming`, { immediate: true });
|
||||
}
|
||||
|
||||
if (record.outgoing && (!record.outgoing.writable || record.outgoing.destroyed)) {
|
||||
cleanupSocket(record.outgoing, `${record.id}-outgoing`, { immediate: true });
|
||||
}
|
||||
|
||||
// If either socket is still active, don't remove the record yet
|
||||
if ((record.incoming && record.incoming.writable) ||
|
||||
(record.outgoing && record.outgoing.writable)) {
|
||||
record.connectionClosed = false; // Reset flag
|
||||
return; // Don't finish cleanup
|
||||
}
|
||||
|
||||
// Continue with full cleanup...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 4: Testing and Validation
|
||||
|
||||
#### 4.1 Test Cases to Implement
|
||||
1. WebSocket connection should stay open for >1 hour
|
||||
2. HTTP streaming response should continue after request closes
|
||||
3. Half-open connections should work correctly
|
||||
4. Verify no socket leaks with long-running connections
|
||||
5. Test graceful shutdown with pending data
|
||||
|
||||
#### 4.2 Socket Leak Prevention
|
||||
- Ensure all event listeners are tracked and removed
|
||||
- Use WeakMap for socket metadata to prevent memory leaks
|
||||
- Implement connection count monitoring
|
||||
- Add periodic health checks for orphaned sockets
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. **Day 1**: Implement graceful `cleanupSocket()` and independent socket handlers
|
||||
2. **Day 2**: Update all handlers to use new cleanup mechanism
|
||||
3. **Day 3**: Fix timeout handling to not close connections
|
||||
4. **Day 4**: Update connection manager parity check and cleanup logic
|
||||
5. **Day 5**: Comprehensive testing and leak detection
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
Add new options to SmartProxyOptions:
|
||||
```typescript
|
||||
interface ISmartProxyOptions {
|
||||
// Existing options...
|
||||
|
||||
// New options for long-lived connections
|
||||
socketCleanupGracePeriod?: number; // Default: 5000ms
|
||||
allowHalfOpenConnections?: boolean; // Default: true
|
||||
parityCheckTimeout?: number; // Default: 1800000ms (30 min)
|
||||
timeoutBehavior?: 'close' | 'reset' | 'ignore'; // Default: 'reset'
|
||||
}
|
||||
```
|
||||
|
||||
## Success Metrics
|
||||
|
||||
1. WebSocket connections remain stable for 24+ hours
|
||||
2. No premature connection closures reported
|
||||
3. Memory usage remains stable (no socket leaks)
|
||||
4. Half-open connections work correctly
|
||||
5. Graceful shutdown completes within grace period
|
||||
|
||||
## Implementation Status: COMPLETED ✅
|
||||
|
||||
### Implemented Changes
|
||||
|
||||
1. **Modified `cleanupSocket()` in `socket-utils.ts`**
|
||||
- Added `CleanupOptions` interface with `immediate`, `allowDrain`, and `gracePeriod` options
|
||||
- Implemented graceful shutdown support with write buffer draining
|
||||
|
||||
2. **Created `createIndependentSocketHandlers()` in `socket-utils.ts`**
|
||||
- Tracks socket states independently
|
||||
- Supports half-open connections where one side can close while the other remains open
|
||||
- Only triggers full cleanup when both sockets are closed
|
||||
|
||||
3. **Updated `setupSocketHandlers()` in `socket-utils.ts`**
|
||||
- Added optional `handleTimeout` parameter to customize timeout behavior
|
||||
- Prevents automatic connection closure on timeout events
|
||||
|
||||
4. **Updated HTTPS Passthrough Handler**
|
||||
- Now uses `createIndependentSocketHandlers` for half-open support
|
||||
- Custom timeout handling that resets timer instead of closing connection
|
||||
- Manual data forwarding with backpressure handling
|
||||
|
||||
5. **Updated Connection Manager**
|
||||
- Extended parity check from 2 minutes to 30 minutes
|
||||
- Added activity check before closing (10 minutes of inactivity required)
|
||||
- Modified cleanup to check socket states before destroying
|
||||
|
||||
6. **Updated Basic Forwarding in Route Connection Handler**
|
||||
- Replaced simple `pipe()` with independent socket handlers
|
||||
- Added manual data forwarding with backpressure support
|
||||
- Removed bilateral close handlers to prevent premature cleanup
|
||||
|
||||
### Test Results
|
||||
|
||||
All tests passing:
|
||||
- ✅ Long-lived connection test: Connection stayed open for 61+ seconds with periodic keep-alive
|
||||
- ✅ Half-open connection test: One side closed while the other continued to send data
|
||||
- ✅ No socket leaks or premature closures
|
||||
|
||||
### Notes
|
||||
|
||||
- The fix maintains backward compatibility
|
||||
- No configuration changes required for existing deployments
|
||||
- Long-lived connections now work correctly in both HTTPS passthrough and basic forwarding modes
|
||||
- The race condition is the most critical issue
|
||||
- Connection counting logic needs complete overhaul
|
||||
- Consider using a connection state machine for clarity
|
||||
- Add connection lifecycle events for debugging
|
Loading…
x
Reference in New Issue
Block a user