4.7 KiB
4.7 KiB
SmartProxy Socket Handling Fix Plan
Reread CLAUDE.md file for guidelines
Problem Summary
The SmartProxy server is experiencing critical issues:
- Server crashes due to unhandled socket connection errors (ECONNREFUSED)
- Memory leak with steadily rising active connection count
- Race conditions between socket creation and error handler attachment
- Orphaned sockets when server connections fail
Root Causes
1. Delayed Error Handler Attachment
- Sockets created without immediate error handlers
- Error events can fire before handlers attached
- Causes uncaught exceptions and server crashes
2. Incomplete Cleanup Logic
- Client sockets not cleaned up when server connection fails
- Connection counter only decrements after BOTH sockets close
- Failed server connections leave orphaned client sockets
3. Missing Global Error Handlers
- No process-level uncaughtException handler
- No process-level unhandledRejection handler
- Any unhandled error crashes entire server
Implementation Plan
Phase 1: Prevent Server Crashes (Critical)
1.1 Add Global Error Handlers
- Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
- Log errors with context before graceful shutdown
- Implement graceful shutdown sequence
1.2 Fix Socket Creation Race Condition
- Modify socket creation to attach error handlers immediately
- Update all forwarding handlers (https-passthrough, http, etc.)
- Ensure error handlers attached in same tick as socket creation
Phase 2: Fix Memory Leaks (High Priority)
2.1 Fix Connection Cleanup Logic
- Clean up client socket immediately if server connection fails
- Decrement connection counter on any socket failure
- Implement proper cleanup for half-open connections
2.2 Improve Socket Utils
- Create new utility function for safe socket creation with immediate error handling
- Update createIndependentSocketHandlers to handle immediate failures
- Add connection tracking debug utilities
Phase 3: Comprehensive Testing (Important)
3.1 Create Test Cases
- Test ECONNREFUSED scenario
- Test timeout handling
- Test half-open connections
- Test rapid connect/disconnect cycles
3.2 Add Monitoring
- Add connection leak detection
- Add metrics for connection lifecycle
- Add debug logging for socket state transitions
Detailed Implementation Steps
Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)
// Add in constructor or start method
process.on('uncaughtException', (error) => {
logger.log('error', 'Uncaught exception', { error });
// Graceful shutdown
});
process.on('unhandledRejection', (reason, promise) => {
logger.log('error', 'Unhandled rejection', { reason, promise });
});
Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)
export function createSocketWithErrorHandler(
options: net.NetConnectOpts,
onError: (err: Error) => void
): net.Socket {
const socket = net.connect(options);
socket.on('error', onError);
return socket;
}
Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)
- Replace direct socket creation with safe creation
- Handle server connection failures immediately
- Clean up client socket on server connection failure
Step 4: Fix Connection Counting
- Decrement on ANY socket close, not just when both close
- Track failed connections separately
- Add connection state tracking
Step 5: Update All Handlers
- https-passthrough-handler.ts
- http-handler.ts
- https-terminate-to-http-handler.ts
- https-terminate-to-https-handler.ts
- route-connection-handler.ts
Success Criteria
- No server crashes on ECONNREFUSED or other socket errors
- Active connections remain stable (no steady increase)
- All sockets properly cleaned up on errors
- Memory usage remains stable under load
- Graceful handling of all error scenarios
Testing Plan
- Simulate ECONNREFUSED by targeting closed ports
- Monitor active connection count over time
- Stress test with rapid connections
- Test with unreachable hosts
- Test with slow/timing out connections
Rollback Plan
If issues arise:
- Revert socket creation changes
- Keep global error handlers (they add safety)
- Add more detailed logging for debugging
- Implement fixes incrementally
Timeline
- Phase 1: Immediate (prevents crashes)
- Phase 2: Within 24 hours (fixes leaks)
- Phase 3: Within 48 hours (ensures stability)
Notes
- The race condition is the most critical issue
- Connection counting logic needs complete overhaul
- Consider using a connection state machine for clarity
- Add connection lifecycle events for debugging