Refactor socket handling plan to address server crashes, memory leaks, and race conditions
This commit is contained in:
parent
37c87e8450
commit
9fdc2d5069
431
readme.plan.md
431
readme.plan.md
@ -1,337 +1,148 @@
|
|||||||
# SmartProxy Socket Cleanup Fix Plan
|
# SmartProxy Socket Handling Fix Plan
|
||||||
|
|
||||||
|
Reread CLAUDE.md file for guidelines
|
||||||
|
|
||||||
## Problem Summary
|
## Problem Summary
|
||||||
|
|
||||||
The current socket cleanup implementation is too aggressive and closes long-lived connections prematurely. This affects:
|
The SmartProxy server is experiencing critical issues:
|
||||||
- WebSocket connections in HTTPS passthrough
|
1. **Server crashes** due to unhandled socket connection errors (ECONNREFUSED)
|
||||||
- Long-lived HTTP connections (SSE, streaming)
|
2. **Memory leak** with steadily rising active connection count
|
||||||
- Database connections
|
3. **Race conditions** between socket creation and error handler attachment
|
||||||
- Any connection that should remain open for hours
|
4. **Orphaned sockets** when server connections fail
|
||||||
|
|
||||||
## Root Causes
|
## Root Causes
|
||||||
|
|
||||||
### 1. **Bilateral Socket Cleanup**
|
### 1. Delayed Error Handler Attachment
|
||||||
When one socket closes, both sockets are immediately destroyed:
|
- Sockets created without immediate error handlers
|
||||||
```typescript
|
- Error events can fire before handlers attached
|
||||||
// In createSocketCleanupHandler
|
- Causes uncaught exceptions and server crashes
|
||||||
cleanupSocket(clientSocket, 'client');
|
|
||||||
cleanupSocket(serverSocket, 'server'); // Both destroyed together!
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. **Aggressive Timeout Handling**
|
### 2. Incomplete Cleanup Logic
|
||||||
Timeout events immediately trigger connection cleanup:
|
- Client sockets not cleaned up when server connection fails
|
||||||
|
- Connection counter only decrements after BOTH sockets close
|
||||||
|
- Failed server connections leave orphaned client sockets
|
||||||
|
|
||||||
|
### 3. Missing Global Error Handlers
|
||||||
|
- No process-level uncaughtException handler
|
||||||
|
- No process-level unhandledRejection handler
|
||||||
|
- Any unhandled error crashes entire server
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Prevent Server Crashes (Critical)
|
||||||
|
|
||||||
|
#### 1.1 Add Global Error Handlers
|
||||||
|
- [ ] Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
|
||||||
|
- [ ] Log errors with context before graceful shutdown
|
||||||
|
- [ ] Implement graceful shutdown sequence
|
||||||
|
|
||||||
|
#### 1.2 Fix Socket Creation Race Condition
|
||||||
|
- [ ] Modify socket creation to attach error handlers immediately
|
||||||
|
- [ ] Update all forwarding handlers (https-passthrough, http, etc.)
|
||||||
|
- [ ] Ensure error handlers attached in same tick as socket creation
|
||||||
|
|
||||||
|
### Phase 2: Fix Memory Leaks (High Priority)
|
||||||
|
|
||||||
|
#### 2.1 Fix Connection Cleanup Logic
|
||||||
|
- [ ] Clean up client socket immediately if server connection fails
|
||||||
|
- [ ] Decrement connection counter on any socket failure
|
||||||
|
- [ ] Implement proper cleanup for half-open connections
|
||||||
|
|
||||||
|
#### 2.2 Improve Socket Utils
|
||||||
|
- [ ] Create new utility function for safe socket creation with immediate error handling
|
||||||
|
- [ ] Update createIndependentSocketHandlers to handle immediate failures
|
||||||
|
- [ ] Add connection tracking debug utilities
|
||||||
|
|
||||||
|
### Phase 3: Comprehensive Testing (Important)
|
||||||
|
|
||||||
|
#### 3.1 Create Test Cases
|
||||||
|
- [ ] Test ECONNREFUSED scenario
|
||||||
|
- [ ] Test timeout handling
|
||||||
|
- [ ] Test half-open connections
|
||||||
|
- [ ] Test rapid connect/disconnect cycles
|
||||||
|
|
||||||
|
#### 3.2 Add Monitoring
|
||||||
|
- [ ] Add connection leak detection
|
||||||
|
- [ ] Add metrics for connection lifecycle
|
||||||
|
- [ ] Add debug logging for socket state transitions
|
||||||
|
|
||||||
|
## Detailed Implementation Steps
|
||||||
|
|
||||||
|
### Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)
|
||||||
```typescript
|
```typescript
|
||||||
socket.on('timeout', () => {
|
// Add in constructor or start method
|
||||||
handleClose(`${prefix}_timeout`); // Destroys both sockets!
|
process.on('uncaughtException', (error) => {
|
||||||
|
logger.log('error', 'Uncaught exception', { error });
|
||||||
|
// Graceful shutdown
|
||||||
|
});
|
||||||
|
|
||||||
|
process.on('unhandledRejection', (reason, promise) => {
|
||||||
|
logger.log('error', 'Unhandled rejection', { reason, promise });
|
||||||
});
|
});
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. **Parity Check Forces Closure**
|
### Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)
|
||||||
If one socket closes but the other remains open for >2 minutes, connection is forcefully terminated:
|
|
||||||
```typescript
|
```typescript
|
||||||
if (record.outgoingClosedTime &&
|
export function createSocketWithErrorHandler(
|
||||||
!record.incoming.destroyed &&
|
options: net.NetConnectOpts,
|
||||||
now - record.outgoingClosedTime > 120000) {
|
onError: (err: Error) => void
|
||||||
this.cleanupConnection(record, 'parity_check');
|
): net.Socket {
|
||||||
|
const socket = net.connect(options);
|
||||||
|
socket.on('error', onError);
|
||||||
|
return socket;
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. **No Half-Open Connection Support**
|
### Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)
|
||||||
The proxy doesn't support TCP half-open connections where one side closes while the other continues sending.
|
- Replace direct socket creation with safe creation
|
||||||
|
- Handle server connection failures immediately
|
||||||
|
- Clean up client socket on server connection failure
|
||||||
|
|
||||||
## Fix Implementation Plan
|
### Step 4: Fix Connection Counting
|
||||||
|
- Decrement on ANY socket close, not just when both close
|
||||||
|
- Track failed connections separately
|
||||||
|
- Add connection state tracking
|
||||||
|
|
||||||
### Phase 1: Fix Socket Cleanup (Prevent Premature Closure)
|
### Step 5: Update All Handlers
|
||||||
|
- [ ] https-passthrough-handler.ts
|
||||||
|
- [ ] http-handler.ts
|
||||||
|
- [ ] https-terminate-to-http-handler.ts
|
||||||
|
- [ ] https-terminate-to-https-handler.ts
|
||||||
|
- [ ] route-connection-handler.ts
|
||||||
|
|
||||||
#### 1.1 Modify `cleanupSocket()` to support graceful shutdown
|
## Success Criteria
|
||||||
```typescript
|
|
||||||
export interface CleanupOptions {
|
|
||||||
immediate?: boolean; // Force immediate destruction
|
|
||||||
allowDrain?: boolean; // Allow write buffer to drain
|
|
||||||
gracePeriod?: number; // Ms to wait before force close
|
|
||||||
}
|
|
||||||
|
|
||||||
export function cleanupSocket(
|
1. **No server crashes** on ECONNREFUSED or other socket errors
|
||||||
socket: Socket | TLSSocket | null,
|
2. **Active connections** remain stable (no steady increase)
|
||||||
socketName?: string,
|
3. **All sockets** properly cleaned up on errors
|
||||||
options: CleanupOptions = {}
|
4. **Memory usage** remains stable under load
|
||||||
): Promise<void> {
|
5. **Graceful handling** of all error scenarios
|
||||||
if (!socket || socket.destroyed) return Promise.resolve();
|
|
||||||
|
|
||||||
return new Promise<void>((resolve) => {
|
|
||||||
const cleanup = () => {
|
|
||||||
socket.removeAllListeners();
|
|
||||||
if (!socket.destroyed) {
|
|
||||||
socket.destroy();
|
|
||||||
}
|
|
||||||
resolve();
|
|
||||||
};
|
|
||||||
|
|
||||||
if (options.immediate) {
|
|
||||||
cleanup();
|
|
||||||
} else if (options.allowDrain && socket.writable) {
|
|
||||||
// Allow pending writes to complete
|
|
||||||
socket.end(() => cleanup());
|
|
||||||
|
|
||||||
// Force cleanup after grace period
|
|
||||||
if (options.gracePeriod) {
|
|
||||||
setTimeout(cleanup, options.gracePeriod);
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
cleanup();
|
|
||||||
}
|
|
||||||
});
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 1.2 Implement Independent Socket Tracking
|
## Testing Plan
|
||||||
```typescript
|
|
||||||
export function createIndependentSocketHandlers(
|
|
||||||
clientSocket: Socket,
|
|
||||||
serverSocket: Socket,
|
|
||||||
onBothClosed: (reason: string) => void
|
|
||||||
): { cleanupClient: () => void, cleanupServer: () => void } {
|
|
||||||
let clientClosed = false;
|
|
||||||
let serverClosed = false;
|
|
||||||
let clientReason = '';
|
|
||||||
let serverReason = '';
|
|
||||||
|
|
||||||
const checkBothClosed = () => {
|
|
||||||
if (clientClosed && serverClosed) {
|
|
||||||
onBothClosed(`client: ${clientReason}, server: ${serverReason}`);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
const cleanupClient = async (reason: string) => {
|
|
||||||
if (clientClosed) return;
|
|
||||||
clientClosed = true;
|
|
||||||
clientReason = reason;
|
|
||||||
|
|
||||||
// Allow server to continue if still active
|
|
||||||
if (!serverClosed && serverSocket.writable) {
|
|
||||||
// Half-close: stop reading from client, let server finish
|
|
||||||
clientSocket.pause();
|
|
||||||
clientSocket.unpipe(serverSocket);
|
|
||||||
await cleanupSocket(clientSocket, 'client', { allowDrain: true });
|
|
||||||
} else {
|
|
||||||
await cleanupSocket(clientSocket, 'client');
|
|
||||||
}
|
|
||||||
|
|
||||||
checkBothClosed();
|
|
||||||
};
|
|
||||||
|
|
||||||
const cleanupServer = async (reason: string) => {
|
|
||||||
if (serverClosed) return;
|
|
||||||
serverClosed = true;
|
|
||||||
serverReason = reason;
|
|
||||||
|
|
||||||
// Allow client to continue if still active
|
|
||||||
if (!clientClosed && clientSocket.writable) {
|
|
||||||
// Half-close: stop reading from server, let client finish
|
|
||||||
serverSocket.pause();
|
|
||||||
serverSocket.unpipe(clientSocket);
|
|
||||||
await cleanupSocket(serverSocket, 'server', { allowDrain: true });
|
|
||||||
} else {
|
|
||||||
await cleanupSocket(serverSocket, 'server');
|
|
||||||
}
|
|
||||||
|
|
||||||
checkBothClosed();
|
|
||||||
};
|
|
||||||
|
|
||||||
return { cleanupClient, cleanupServer };
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 2: Fix Timeout Handling
|
1. Simulate ECONNREFUSED by targeting closed ports
|
||||||
|
2. Monitor active connection count over time
|
||||||
|
3. Stress test with rapid connections
|
||||||
|
4. Test with unreachable hosts
|
||||||
|
5. Test with slow/timing out connections
|
||||||
|
|
||||||
#### 2.1 Separate timeout handling from connection closure
|
## Rollback Plan
|
||||||
```typescript
|
|
||||||
export function setupSocketHandlers(
|
|
||||||
socket: Socket | TLSSocket,
|
|
||||||
handleClose: (reason: string) => void,
|
|
||||||
handleTimeout?: (socket: Socket) => void, // New optional handler
|
|
||||||
errorPrefix?: string
|
|
||||||
): void {
|
|
||||||
socket.on('error', (error) => {
|
|
||||||
const prefix = errorPrefix || 'Socket';
|
|
||||||
handleClose(`${prefix}_error: ${error.message}`);
|
|
||||||
});
|
|
||||||
|
|
||||||
socket.on('close', () => {
|
|
||||||
const prefix = errorPrefix || 'socket';
|
|
||||||
handleClose(`${prefix}_closed`);
|
|
||||||
});
|
|
||||||
|
|
||||||
socket.on('timeout', () => {
|
|
||||||
if (handleTimeout) {
|
|
||||||
handleTimeout(socket); // Custom timeout handling
|
|
||||||
} else {
|
|
||||||
// Default: just log, don't close
|
|
||||||
console.warn(`Socket timeout: ${errorPrefix || 'socket'}`);
|
|
||||||
}
|
|
||||||
});
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 2.2 Update HTTPS passthrough handler
|
If issues arise:
|
||||||
```typescript
|
1. Revert socket creation changes
|
||||||
// In https-passthrough-handler.ts
|
2. Keep global error handlers (they add safety)
|
||||||
const { cleanupClient, cleanupServer } = createIndependentSocketHandlers(
|
3. Add more detailed logging for debugging
|
||||||
clientSocket,
|
4. Implement fixes incrementally
|
||||||
serverSocket,
|
|
||||||
(reason) => {
|
|
||||||
this.emit(ForwardingHandlerEvents.DISCONNECTED, {
|
|
||||||
remoteAddress,
|
|
||||||
bytesSent,
|
|
||||||
bytesReceived,
|
|
||||||
reason
|
|
||||||
});
|
|
||||||
}
|
|
||||||
);
|
|
||||||
|
|
||||||
// Setup handlers with custom timeout handling
|
## Timeline
|
||||||
setupSocketHandlers(clientSocket, cleanupClient, (socket) => {
|
|
||||||
// Just reset timeout, don't close
|
|
||||||
socket.setTimeout(timeout);
|
|
||||||
}, 'client');
|
|
||||||
|
|
||||||
setupSocketHandlers(serverSocket, cleanupServer, (socket) => {
|
- Phase 1: Immediate (prevents crashes)
|
||||||
// Just reset timeout, don't close
|
- Phase 2: Within 24 hours (fixes leaks)
|
||||||
socket.setTimeout(timeout);
|
- Phase 3: Within 48 hours (ensures stability)
|
||||||
}, 'server');
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 3: Fix Connection Manager
|
## Notes
|
||||||
|
|
||||||
#### 3.1 Remove aggressive parity check
|
- The race condition is the most critical issue
|
||||||
```typescript
|
- Connection counting logic needs complete overhaul
|
||||||
// Remove or significantly increase the parity check timeout
|
- Consider using a connection state machine for clarity
|
||||||
// From 2 minutes to 30 minutes for long-lived connections
|
- Add connection lifecycle events for debugging
|
||||||
if (record.outgoingClosedTime &&
|
|
||||||
!record.incoming.destroyed &&
|
|
||||||
!record.connectionClosed &&
|
|
||||||
now - record.outgoingClosedTime > 1800000) { // 30 minutes
|
|
||||||
// Only close if no data activity
|
|
||||||
if (now - record.lastActivity > 600000) { // 10 minutes of inactivity
|
|
||||||
this.cleanupConnection(record, 'parity_check');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 3.2 Update cleanupConnection to check socket states
|
|
||||||
```typescript
|
|
||||||
public cleanupConnection(record: IConnectionRecord, reason: string = 'normal'): void {
|
|
||||||
if (!record.connectionClosed) {
|
|
||||||
record.connectionClosed = true;
|
|
||||||
|
|
||||||
// Only cleanup sockets that are actually closed or inactive
|
|
||||||
if (record.incoming && (!record.incoming.writable || record.incoming.destroyed)) {
|
|
||||||
cleanupSocket(record.incoming, `${record.id}-incoming`, { immediate: true });
|
|
||||||
}
|
|
||||||
|
|
||||||
if (record.outgoing && (!record.outgoing.writable || record.outgoing.destroyed)) {
|
|
||||||
cleanupSocket(record.outgoing, `${record.id}-outgoing`, { immediate: true });
|
|
||||||
}
|
|
||||||
|
|
||||||
// If either socket is still active, don't remove the record yet
|
|
||||||
if ((record.incoming && record.incoming.writable) ||
|
|
||||||
(record.outgoing && record.outgoing.writable)) {
|
|
||||||
record.connectionClosed = false; // Reset flag
|
|
||||||
return; // Don't finish cleanup
|
|
||||||
}
|
|
||||||
|
|
||||||
// Continue with full cleanup...
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 4: Testing and Validation
|
|
||||||
|
|
||||||
#### 4.1 Test Cases to Implement
|
|
||||||
1. WebSocket connection should stay open for >1 hour
|
|
||||||
2. HTTP streaming response should continue after request closes
|
|
||||||
3. Half-open connections should work correctly
|
|
||||||
4. Verify no socket leaks with long-running connections
|
|
||||||
5. Test graceful shutdown with pending data
|
|
||||||
|
|
||||||
#### 4.2 Socket Leak Prevention
|
|
||||||
- Ensure all event listeners are tracked and removed
|
|
||||||
- Use WeakMap for socket metadata to prevent memory leaks
|
|
||||||
- Implement connection count monitoring
|
|
||||||
- Add periodic health checks for orphaned sockets
|
|
||||||
|
|
||||||
## Implementation Order
|
|
||||||
|
|
||||||
1. **Day 1**: Implement graceful `cleanupSocket()` and independent socket handlers
|
|
||||||
2. **Day 2**: Update all handlers to use new cleanup mechanism
|
|
||||||
3. **Day 3**: Fix timeout handling to not close connections
|
|
||||||
4. **Day 4**: Update connection manager parity check and cleanup logic
|
|
||||||
5. **Day 5**: Comprehensive testing and leak detection
|
|
||||||
|
|
||||||
## Configuration Changes
|
|
||||||
|
|
||||||
Add new options to SmartProxyOptions:
|
|
||||||
```typescript
|
|
||||||
interface ISmartProxyOptions {
|
|
||||||
// Existing options...
|
|
||||||
|
|
||||||
// New options for long-lived connections
|
|
||||||
socketCleanupGracePeriod?: number; // Default: 5000ms
|
|
||||||
allowHalfOpenConnections?: boolean; // Default: true
|
|
||||||
parityCheckTimeout?: number; // Default: 1800000ms (30 min)
|
|
||||||
timeoutBehavior?: 'close' | 'reset' | 'ignore'; // Default: 'reset'
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Success Metrics
|
|
||||||
|
|
||||||
1. WebSocket connections remain stable for 24+ hours
|
|
||||||
2. No premature connection closures reported
|
|
||||||
3. Memory usage remains stable (no socket leaks)
|
|
||||||
4. Half-open connections work correctly
|
|
||||||
5. Graceful shutdown completes within grace period
|
|
||||||
|
|
||||||
## Implementation Status: COMPLETED ✅
|
|
||||||
|
|
||||||
### Implemented Changes
|
|
||||||
|
|
||||||
1. **Modified `cleanupSocket()` in `socket-utils.ts`**
|
|
||||||
- Added `CleanupOptions` interface with `immediate`, `allowDrain`, and `gracePeriod` options
|
|
||||||
- Implemented graceful shutdown support with write buffer draining
|
|
||||||
|
|
||||||
2. **Created `createIndependentSocketHandlers()` in `socket-utils.ts`**
|
|
||||||
- Tracks socket states independently
|
|
||||||
- Supports half-open connections where one side can close while the other remains open
|
|
||||||
- Only triggers full cleanup when both sockets are closed
|
|
||||||
|
|
||||||
3. **Updated `setupSocketHandlers()` in `socket-utils.ts`**
|
|
||||||
- Added optional `handleTimeout` parameter to customize timeout behavior
|
|
||||||
- Prevents automatic connection closure on timeout events
|
|
||||||
|
|
||||||
4. **Updated HTTPS Passthrough Handler**
|
|
||||||
- Now uses `createIndependentSocketHandlers` for half-open support
|
|
||||||
- Custom timeout handling that resets timer instead of closing connection
|
|
||||||
- Manual data forwarding with backpressure handling
|
|
||||||
|
|
||||||
5. **Updated Connection Manager**
|
|
||||||
- Extended parity check from 2 minutes to 30 minutes
|
|
||||||
- Added activity check before closing (10 minutes of inactivity required)
|
|
||||||
- Modified cleanup to check socket states before destroying
|
|
||||||
|
|
||||||
6. **Updated Basic Forwarding in Route Connection Handler**
|
|
||||||
- Replaced simple `pipe()` with independent socket handlers
|
|
||||||
- Added manual data forwarding with backpressure support
|
|
||||||
- Removed bilateral close handlers to prevent premature cleanup
|
|
||||||
|
|
||||||
### Test Results
|
|
||||||
|
|
||||||
All tests passing:
|
|
||||||
- ✅ Long-lived connection test: Connection stayed open for 61+ seconds with periodic keep-alive
|
|
||||||
- ✅ Half-open connection test: One side closed while the other continued to send data
|
|
||||||
- ✅ No socket leaks or premature closures
|
|
||||||
|
|
||||||
### Notes
|
|
||||||
|
|
||||||
- The fix maintains backward compatibility
|
|
||||||
- No configuration changes required for existing deployments
|
|
||||||
- Long-lived connections now work correctly in both HTTPS passthrough and basic forwarding modes
|
|
Loading…
x
Reference in New Issue
Block a user