smartproxy/readme.plan.md

4.7 KiB

SmartProxy Socket Handling Fix Plan

Reread CLAUDE.md file for guidelines

Problem Summary

The SmartProxy server is experiencing critical issues:

  1. Server crashes due to unhandled socket connection errors (ECONNREFUSED)
  2. Memory leak with steadily rising active connection count
  3. Race conditions between socket creation and error handler attachment
  4. Orphaned sockets when server connections fail

Root Causes

1. Delayed Error Handler Attachment

  • Sockets created without immediate error handlers
  • Error events can fire before handlers attached
  • Causes uncaught exceptions and server crashes

2. Incomplete Cleanup Logic

  • Client sockets not cleaned up when server connection fails
  • Connection counter only decrements after BOTH sockets close
  • Failed server connections leave orphaned client sockets

3. Missing Global Error Handlers

  • No process-level uncaughtException handler
  • No process-level unhandledRejection handler
  • Any unhandled error crashes entire server

Implementation Plan

Phase 1: Prevent Server Crashes (Critical)

1.1 Add Global Error Handlers

  • Add global error handlers in main entry point (ts/index.ts or smart-proxy.ts)
  • Log errors with context before graceful shutdown
  • Implement graceful shutdown sequence

1.2 Fix Socket Creation Race Condition

  • Modify socket creation to attach error handlers immediately
  • Update all forwarding handlers (https-passthrough, http, etc.)
  • Ensure error handlers attached in same tick as socket creation

Phase 2: Fix Memory Leaks (High Priority)

2.1 Fix Connection Cleanup Logic

  • Clean up client socket immediately if server connection fails
  • Decrement connection counter on any socket failure
  • Implement proper cleanup for half-open connections

2.2 Improve Socket Utils

  • Create new utility function for safe socket creation with immediate error handling
  • Update createIndependentSocketHandlers to handle immediate failures
  • Add connection tracking debug utilities

Phase 3: Comprehensive Testing (Important)

3.1 Create Test Cases

  • Test ECONNREFUSED scenario
  • Test timeout handling
  • Test half-open connections
  • Test rapid connect/disconnect cycles

3.2 Add Monitoring

  • Add connection leak detection
  • Add metrics for connection lifecycle
  • Add debug logging for socket state transitions

Detailed Implementation Steps

Step 1: Global Error Handlers (ts/proxies/smart-proxy/smart-proxy.ts)

// Add in constructor or start method
process.on('uncaughtException', (error) => {
  logger.log('error', 'Uncaught exception', { error });
  // Graceful shutdown
});

process.on('unhandledRejection', (reason, promise) => {
  logger.log('error', 'Unhandled rejection', { reason, promise });
});

Step 2: Safe Socket Creation Utility (ts/core/utils/socket-utils.ts)

export function createSocketWithErrorHandler(
  options: net.NetConnectOpts,
  onError: (err: Error) => void
): net.Socket {
  const socket = net.connect(options);
  socket.on('error', onError);
  return socket;
}

Step 3: Fix HttpsPassthroughHandler (ts/forwarding/handlers/https-passthrough-handler.ts)

  • Replace direct socket creation with safe creation
  • Handle server connection failures immediately
  • Clean up client socket on server connection failure

Step 4: Fix Connection Counting

  • Decrement on ANY socket close, not just when both close
  • Track failed connections separately
  • Add connection state tracking

Step 5: Update All Handlers

  • https-passthrough-handler.ts
  • http-handler.ts
  • https-terminate-to-http-handler.ts
  • https-terminate-to-https-handler.ts
  • route-connection-handler.ts

Success Criteria

  1. No server crashes on ECONNREFUSED or other socket errors
  2. Active connections remain stable (no steady increase)
  3. All sockets properly cleaned up on errors
  4. Memory usage remains stable under load
  5. Graceful handling of all error scenarios

Testing Plan

  1. Simulate ECONNREFUSED by targeting closed ports
  2. Monitor active connection count over time
  3. Stress test with rapid connections
  4. Test with unreachable hosts
  5. Test with slow/timing out connections

Rollback Plan

If issues arise:

  1. Revert socket creation changes
  2. Keep global error handlers (they add safety)
  3. Add more detailed logging for debugging
  4. Implement fixes incrementally

Timeline

  • Phase 1: Immediate (prevents crashes)
  • Phase 2: Within 24 hours (fixes leaks)
  • Phase 3: Within 48 hours (ensures stability)

Notes

  • The race condition is the most critical issue
  • Connection counting logic needs complete overhaul
  • Consider using a connection state machine for clarity
  • Add connection lifecycle events for debugging