# SmartProxy Development Hints ## Byte Tracking and Metrics ### Throughput Drift Issue (Fixed) **Problem**: Throughput numbers were gradually increasing over time for long-lived connections. **Root Cause**: The `byRoute()` and `byIP()` methods were dividing cumulative total bytes (since connection start) by the window duration, causing rates to appear higher as connections aged: - Hour 1: 1GB total / 60s = 17 MB/s ✓ - Hour 2: 2GB total / 60s = 34 MB/s ✗ (appears doubled!) - Hour 3: 3GB total / 60s = 50 MB/s ✗ (keeps rising!) **Solution**: Implemented dedicated ThroughputTracker instances for each route and IP address: - Each route and IP gets its own throughput tracker with per-second sampling - Samples are taken every second and stored in a circular buffer - Rate calculations use actual samples within the requested window - Default window is now 1 second for real-time accuracy ### What Gets Counted (Network Interface Throughput) The byte tracking is designed to match network interface throughput (what Unifi/network monitoring tools show): **Counted bytes include:** - All application data - TLS handshakes and protocol overhead - TLS record headers and encryption padding - HTTP headers and protocol data - WebSocket frames and protocol overhead - TLS alerts sent to clients **NOT counted:** - PROXY protocol headers (sent to backend, not client) - TCP/IP headers (handled by OS, not visible at application layer) **Byte direction:** - `bytesReceived`: All bytes received FROM the client on the incoming connection - `bytesSent`: All bytes sent TO the client on the incoming connection - Backend connections are separate and not mixed with client metrics ### Double Counting Issue (Fixed) **Problem**: Initial data chunks were being counted twice in the byte tracking: 1. Once when stored in `pendingData` in `setupDirectConnection()` 2. Again when the data flowed through bidirectional forwarding **Solution**: Removed the byte counting when storing initial chunks. Bytes are now only counted when they actually flow through the `setupBidirectionalForwarding()` callbacks. ### HttpProxy Metrics (Fixed) **Problem**: HttpProxy forwarding was updating connection record byte counts but not calling `metricsCollector.recordBytes()`, resulting in missing throughput data. **Solution**: Added `metricsCollector.recordBytes()` calls to the HttpProxy bidirectional forwarding callbacks. ### Metrics Architecture The metrics system has multiple layers: 1. **Connection Records** (`record.bytesReceived/bytesSent`): Track total bytes per connection 2. **Global ThroughputTracker**: Accumulates bytes between samples for overall rate calculations 3. **Per-Route ThroughputTrackers**: Dedicated tracker for each route with per-second sampling 4. **Per-IP ThroughputTrackers**: Dedicated tracker for each IP with per-second sampling 5. **connectionByteTrackers**: Track cumulative bytes and metadata for active connections Key features: - All throughput trackers sample every second (1Hz) - Each tracker maintains a circular buffer of samples (default: 1 hour retention) - Rate calculations are accurate for any requested window (default: 1 second) - All byte counting happens exactly once at the data flow point - Unused route/IP trackers are automatically cleaned up when connections close ### Understanding "High" Byte Counts If byte counts seem high compared to actual application data, remember: - TLS handshakes can be 1-5KB depending on cipher suites and certificates - Each TLS record has 5 bytes of header overhead - TLS encryption adds 16-48 bytes of padding/MAC per record - HTTP/2 has additional framing overhead - WebSocket has frame headers (2-14 bytes per message) This overhead is real network traffic and should be counted for accurate throughput metrics. ### Byte Counting Paths There are two mutually exclusive paths for connections: 1. **Direct forwarding** (route-connection-handler.ts): - Used for TCP passthrough, TLS passthrough, and direct connections - Bytes counted in `setupBidirectionalForwarding` callbacks - Initial chunk NOT counted separately (flows through bidirectional forwarding) 2. **HttpProxy forwarding** (http-proxy-bridge.ts): - Used for TLS termination (terminate, terminate-and-reencrypt) - Initial chunk counted when written to proxy - All subsequent bytes counted in `setupBidirectionalForwarding` callbacks - This is the ONLY counting point for these connections ### Byte Counting Audit (2025-01-06) A comprehensive audit was performed to verify byte counting accuracy: **Audit Results:** - ✅ No double counting detected in any connection flow - ✅ Each byte counted exactly once in each direction - ✅ Connection records and metrics updated consistently - ✅ PROXY protocol headers correctly excluded from client metrics - ✅ NFTables forwarded connections correctly not counted (kernel handles) **Key Implementation Points:** - All byte counting happens in only 2 files: `route-connection-handler.ts` and `http-proxy-bridge.ts` - Both use the same pattern: increment `record.bytesReceived/Sent` AND call `metricsCollector.recordBytes()` - Initial chunks handled correctly: stored but not counted until forwarded - TLS alerts counted as sent bytes (correct - they are sent to client) For full audit details, see `readme.byte-counting-audit.md` ## Connection Cleanup ### Zombie Connection Detection The connection manager performs comprehensive zombie detection every 10 seconds: - **Full zombies**: Both incoming and outgoing sockets destroyed but connection not cleaned up - **Half zombies**: One socket destroyed, grace period expired (5 minutes for TLS, 30 seconds for non-TLS) - **Stuck connections**: Data received but none sent back after threshold (5 minutes for TLS, 60 seconds for non-TLS) ### Cleanup Queue Connections are cleaned up through a batched queue system: - Batch size: 100 connections - Processing triggered immediately when batch size reached - Otherwise processed after 100ms delay - Prevents overwhelming the system during mass disconnections ## Keep-Alive Handling Keep-alive connections receive special treatment based on `keepAliveTreatment` setting: - **standard**: Normal timeout applies - **extended**: Timeout multiplied by `keepAliveInactivityMultiplier` (default 6x) - **immortal**: No timeout, connections persist indefinitely ## PROXY Protocol The system supports both receiving and sending PROXY protocol: - **Receiving**: Automatically detected from trusted proxy IPs (configured in `proxyIPs`) - **Sending**: Enabled per-route or globally via `sendProxyProtocol` setting - Real client IP is preserved and used for all connection tracking and security checks ## Metrics and Throughput Calculation The metrics system tracks throughput using per-second sampling: 1. **Byte Recording**: Bytes are recorded as data flows through connections 2. **Sampling**: Every second, accumulated bytes are stored as a sample 3. **Rate Calculation**: Throughput is calculated by summing bytes over a time window 4. **Per-Route/IP Tracking**: Separate ThroughputTracker instances for each route and IP Key implementation details: - Bytes are recorded in the bidirectional forwarding callbacks - The instant() method returns throughput over the last 1 second - The recent() method returns throughput over the last 10 seconds - Custom windows can be specified for different averaging periods ### Throughput Spikes Issue There's a fundamental difference between application-layer and network-layer throughput: **Application Layer (what we measure)**: - Bytes are recorded when delivered to/from the application - Large chunks can arrive "instantly" due to kernel/Node.js buffering - Shows spikes when buffers are flushed (e.g., 20MB in 1 second = 160 Mbit/s) **Network Layer (what Unifi shows)**: - Actual packet flow through the network interface - Limited by physical network speed (e.g., 20 Mbit/s) - Data transfers over time, not in bursts The spikes occur because: 1. Data flows over network at 20 Mbit/s (takes 8 seconds for 20MB) 2. Kernel/Node.js buffers this incoming data 3. When buffer is flushed, application receives large chunk at once 4. We record entire chunk in current second, creating artificial spike **Potential Solutions**: 1. Use longer window for "instant" measurements (e.g., 5 seconds instead of 1) 2. Track socket write backpressure to estimate actual network flow 3. Implement bandwidth estimation based on connection duration 4. Accept that application-layer != network-layer throughput ## Connection Limiting ### Per-IP Connection Limits - SmartProxy tracks connections per IP address in the SecurityManager - Default limit is 100 connections per IP (configurable via `maxConnectionsPerIP`) - Connection rate limiting is also enforced (default 300 connections/minute per IP) - HttpProxy has been enhanced to also enforce per-IP limits when forwarding from SmartProxy ### Route-Level Connection Limits - Routes can define `security.maxConnections` to limit connections per route - ConnectionManager tracks connections by route ID using a separate Map - Limits are enforced in RouteConnectionHandler before forwarding - Connection is tracked when route is matched: `trackConnectionByRoute(routeId, connectionId)` ### HttpProxy Integration - When SmartProxy forwards to HttpProxy for TLS termination, it sends a `CLIENT_IP:\r\n` header - HttpProxy parses this header to track the real client IP, not the localhost IP - This ensures per-IP limits are enforced even for forwarded connections - The header is parsed in the connection handler before any data processing ### Memory Optimization - Periodic cleanup runs every 60 seconds to remove: - IPs with no active connections - Expired rate limit timestamps (older than 1 minute) - Prevents memory accumulation from many unique IPs over time - Cleanup is automatic and runs in background with `unref()` to not keep process alive ### Connection Cleanup Queue - Cleanup queue processes connections in batches to prevent overwhelming the system - Race condition prevention using `isProcessingCleanup` flag - Try-finally block ensures flag is always reset even if errors occur - New connections added during processing are queued for next batch ### Important Implementation Notes - Always use `NodeJS.Timeout` type instead of `NodeJS.Timer` for interval/timeout references - IPv4/IPv6 normalization is handled (e.g., `::ffff:127.0.0.1` and `127.0.0.1` are treated as the same IP) - Connection limits are checked before route matching to prevent DoS attacks - SharedSecurityManager supports checking route-level limits via optional parameter ## Log Deduplication To reduce log spam during high-traffic scenarios or attacks, SmartProxy implements log deduplication for repetitive events: ### How It Works - Similar log events are batched and aggregated over a 5-second window - Instead of logging each event individually, a summary is emitted - Events are grouped by type and deduplicated by key (e.g., IP address, reason) ### Deduplicated Event Types 1. **Connection Rejections** (`connection-rejected`): - Groups by rejection reason (global-limit, route-limit, etc.) - Example: "Rejected 150 connections (reasons: global-limit: 100, route-limit: 50)" 2. **IP Rejections** (`ip-rejected`): - Groups by IP address - Shows top offenders with rejection counts and reasons - Example: "Rejected 500 connections from 10 IPs (top offenders: 192.168.1.100 (200x, rate-limit), ...)" 3. **Connection Cleanups** (`connection-cleanup`): - Groups by cleanup reason (normal, timeout, error, zombie, etc.) - Example: "Cleaned up 250 connections (reasons: normal: 200, timeout: 30, error: 20)" 4. **IP Tracking Cleanup** (`ip-cleanup`): - Summarizes periodic IP cleanup operations - Example: "IP tracking cleanup: removed 50 entries across 5 cleanup cycles" ### Configuration - Default flush interval: 5 seconds - Maximum batch size: 100 events (triggers immediate flush) - Global periodic flush: Every 10 seconds (ensures logs are emitted regularly) - Process exit handling: Logs are flushed on SIGINT/SIGTERM ### Benefits - Reduces log volume during attacks or high traffic - Provides better overview of patterns (e.g., which IPs are attacking) - Improves log readability and analysis - Prevents log storage overflow - Maintains detailed information in aggregated form ### Log Output Examples Instead of hundreds of individual logs: ``` Connection rejected Connection rejected Connection rejected ... (repeated 500 times) ``` You'll see: ``` [SUMMARY] Rejected 500 connections from 10 IPs in 5s (rate-limit: 350, per-ip-limit: 150) (top offenders: 192.168.1.100 (200x, rate-limit), 10.0.0.1 (150x, per-ip-limit)) ``` Instead of: ``` Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 266 Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 265 ... (repeated 266 times) ``` You'll see: ``` [SUMMARY] 266 HttpProxy connections terminated in 5s (reasons: client_closed: 266, activeConnections: 0) ``` ### Rapid Event Handling - During attacks or high-volume scenarios, logs are flushed more frequently - If 50+ events occur within 1 second, immediate flush is triggered - Prevents memory buildup during flooding attacks - Maintains real-time visibility during incidents