Files
smartproxy/readme.hints.md
Juergen Kunz 2a4ed38f6b update logs
2025-07-03 02:54:56 +00:00

13 KiB

SmartProxy Development Hints

Byte Tracking and Metrics

Throughput Drift Issue (Fixed)

Problem: Throughput numbers were gradually increasing over time for long-lived connections.

Root Cause: The byRoute() and byIP() methods were dividing cumulative total bytes (since connection start) by the window duration, causing rates to appear higher as connections aged:

  • Hour 1: 1GB total / 60s = 17 MB/s ✓
  • Hour 2: 2GB total / 60s = 34 MB/s ✗ (appears doubled!)
  • Hour 3: 3GB total / 60s = 50 MB/s ✗ (keeps rising!)

Solution: Implemented dedicated ThroughputTracker instances for each route and IP address:

  • Each route and IP gets its own throughput tracker with per-second sampling
  • Samples are taken every second and stored in a circular buffer
  • Rate calculations use actual samples within the requested window
  • Default window is now 1 second for real-time accuracy

What Gets Counted (Network Interface Throughput)

The byte tracking is designed to match network interface throughput (what Unifi/network monitoring tools show):

Counted bytes include:

  • All application data
  • TLS handshakes and protocol overhead
  • TLS record headers and encryption padding
  • HTTP headers and protocol data
  • WebSocket frames and protocol overhead
  • TLS alerts sent to clients

NOT counted:

  • PROXY protocol headers (sent to backend, not client)
  • TCP/IP headers (handled by OS, not visible at application layer)

Byte direction:

  • bytesReceived: All bytes received FROM the client on the incoming connection
  • bytesSent: All bytes sent TO the client on the incoming connection
  • Backend connections are separate and not mixed with client metrics

Double Counting Issue (Fixed)

Problem: Initial data chunks were being counted twice in the byte tracking:

  1. Once when stored in pendingData in setupDirectConnection()
  2. Again when the data flowed through bidirectional forwarding

Solution: Removed the byte counting when storing initial chunks. Bytes are now only counted when they actually flow through the setupBidirectionalForwarding() callbacks.

HttpProxy Metrics (Fixed)

Problem: HttpProxy forwarding was updating connection record byte counts but not calling metricsCollector.recordBytes(), resulting in missing throughput data.

Solution: Added metricsCollector.recordBytes() calls to the HttpProxy bidirectional forwarding callbacks.

Metrics Architecture

The metrics system has multiple layers:

  1. Connection Records (record.bytesReceived/bytesSent): Track total bytes per connection
  2. Global ThroughputTracker: Accumulates bytes between samples for overall rate calculations
  3. Per-Route ThroughputTrackers: Dedicated tracker for each route with per-second sampling
  4. Per-IP ThroughputTrackers: Dedicated tracker for each IP with per-second sampling
  5. connectionByteTrackers: Track cumulative bytes and metadata for active connections

Key features:

  • All throughput trackers sample every second (1Hz)
  • Each tracker maintains a circular buffer of samples (default: 1 hour retention)
  • Rate calculations are accurate for any requested window (default: 1 second)
  • All byte counting happens exactly once at the data flow point
  • Unused route/IP trackers are automatically cleaned up when connections close

Understanding "High" Byte Counts

If byte counts seem high compared to actual application data, remember:

  • TLS handshakes can be 1-5KB depending on cipher suites and certificates
  • Each TLS record has 5 bytes of header overhead
  • TLS encryption adds 16-48 bytes of padding/MAC per record
  • HTTP/2 has additional framing overhead
  • WebSocket has frame headers (2-14 bytes per message)

This overhead is real network traffic and should be counted for accurate throughput metrics.

Byte Counting Paths

There are two mutually exclusive paths for connections:

  1. Direct forwarding (route-connection-handler.ts):

    • Used for TCP passthrough, TLS passthrough, and direct connections
    • Bytes counted in setupBidirectionalForwarding callbacks
    • Initial chunk NOT counted separately (flows through bidirectional forwarding)
  2. HttpProxy forwarding (http-proxy-bridge.ts):

    • Used for TLS termination (terminate, terminate-and-reencrypt)
    • Initial chunk counted when written to proxy
    • All subsequent bytes counted in setupBidirectionalForwarding callbacks
    • This is the ONLY counting point for these connections

Byte Counting Audit (2025-01-06)

A comprehensive audit was performed to verify byte counting accuracy:

Audit Results:

  • No double counting detected in any connection flow
  • Each byte counted exactly once in each direction
  • Connection records and metrics updated consistently
  • PROXY protocol headers correctly excluded from client metrics
  • NFTables forwarded connections correctly not counted (kernel handles)

Key Implementation Points:

  • All byte counting happens in only 2 files: route-connection-handler.ts and http-proxy-bridge.ts
  • Both use the same pattern: increment record.bytesReceived/Sent AND call metricsCollector.recordBytes()
  • Initial chunks handled correctly: stored but not counted until forwarded
  • TLS alerts counted as sent bytes (correct - they are sent to client)

For full audit details, see readme.byte-counting-audit.md

Connection Cleanup

Zombie Connection Detection

The connection manager performs comprehensive zombie detection every 10 seconds:

  • Full zombies: Both incoming and outgoing sockets destroyed but connection not cleaned up
  • Half zombies: One socket destroyed, grace period expired (5 minutes for TLS, 30 seconds for non-TLS)
  • Stuck connections: Data received but none sent back after threshold (5 minutes for TLS, 60 seconds for non-TLS)

Cleanup Queue

Connections are cleaned up through a batched queue system:

  • Batch size: 100 connections
  • Processing triggered immediately when batch size reached
  • Otherwise processed after 100ms delay
  • Prevents overwhelming the system during mass disconnections

Keep-Alive Handling

Keep-alive connections receive special treatment based on keepAliveTreatment setting:

  • standard: Normal timeout applies
  • extended: Timeout multiplied by keepAliveInactivityMultiplier (default 6x)
  • immortal: No timeout, connections persist indefinitely

PROXY Protocol

The system supports both receiving and sending PROXY protocol:

  • Receiving: Automatically detected from trusted proxy IPs (configured in proxyIPs)
  • Sending: Enabled per-route or globally via sendProxyProtocol setting
  • Real client IP is preserved and used for all connection tracking and security checks

Metrics and Throughput Calculation

The metrics system tracks throughput using per-second sampling:

  1. Byte Recording: Bytes are recorded as data flows through connections
  2. Sampling: Every second, accumulated bytes are stored as a sample
  3. Rate Calculation: Throughput is calculated by summing bytes over a time window
  4. Per-Route/IP Tracking: Separate ThroughputTracker instances for each route and IP

Key implementation details:

  • Bytes are recorded in the bidirectional forwarding callbacks
  • The instant() method returns throughput over the last 1 second
  • The recent() method returns throughput over the last 10 seconds
  • Custom windows can be specified for different averaging periods

Throughput Spikes Issue

There's a fundamental difference between application-layer and network-layer throughput:

Application Layer (what we measure):

  • Bytes are recorded when delivered to/from the application
  • Large chunks can arrive "instantly" due to kernel/Node.js buffering
  • Shows spikes when buffers are flushed (e.g., 20MB in 1 second = 160 Mbit/s)

Network Layer (what Unifi shows):

  • Actual packet flow through the network interface
  • Limited by physical network speed (e.g., 20 Mbit/s)
  • Data transfers over time, not in bursts

The spikes occur because:

  1. Data flows over network at 20 Mbit/s (takes 8 seconds for 20MB)
  2. Kernel/Node.js buffers this incoming data
  3. When buffer is flushed, application receives large chunk at once
  4. We record entire chunk in current second, creating artificial spike

Potential Solutions:

  1. Use longer window for "instant" measurements (e.g., 5 seconds instead of 1)
  2. Track socket write backpressure to estimate actual network flow
  3. Implement bandwidth estimation based on connection duration
  4. Accept that application-layer != network-layer throughput

Connection Limiting

Per-IP Connection Limits

  • SmartProxy tracks connections per IP address in the SecurityManager
  • Default limit is 100 connections per IP (configurable via maxConnectionsPerIP)
  • Connection rate limiting is also enforced (default 300 connections/minute per IP)
  • HttpProxy has been enhanced to also enforce per-IP limits when forwarding from SmartProxy

Route-Level Connection Limits

  • Routes can define security.maxConnections to limit connections per route
  • ConnectionManager tracks connections by route ID using a separate Map
  • Limits are enforced in RouteConnectionHandler before forwarding
  • Connection is tracked when route is matched: trackConnectionByRoute(routeId, connectionId)

HttpProxy Integration

  • When SmartProxy forwards to HttpProxy for TLS termination, it sends a CLIENT_IP:<ip>\r\n header
  • HttpProxy parses this header to track the real client IP, not the localhost IP
  • This ensures per-IP limits are enforced even for forwarded connections
  • The header is parsed in the connection handler before any data processing

Memory Optimization

  • Periodic cleanup runs every 60 seconds to remove:
    • IPs with no active connections
    • Expired rate limit timestamps (older than 1 minute)
  • Prevents memory accumulation from many unique IPs over time
  • Cleanup is automatic and runs in background with unref() to not keep process alive

Connection Cleanup Queue

  • Cleanup queue processes connections in batches to prevent overwhelming the system
  • Race condition prevention using isProcessingCleanup flag
  • Try-finally block ensures flag is always reset even if errors occur
  • New connections added during processing are queued for next batch

Important Implementation Notes

  • Always use NodeJS.Timeout type instead of NodeJS.Timer for interval/timeout references
  • IPv4/IPv6 normalization is handled (e.g., ::ffff:127.0.0.1 and 127.0.0.1 are treated as the same IP)
  • Connection limits are checked before route matching to prevent DoS attacks
  • SharedSecurityManager supports checking route-level limits via optional parameter

Log Deduplication

To reduce log spam during high-traffic scenarios or attacks, SmartProxy implements log deduplication for repetitive events:

How It Works

  • Similar log events are batched and aggregated over a 5-second window
  • Instead of logging each event individually, a summary is emitted
  • Events are grouped by type and deduplicated by key (e.g., IP address, reason)

Deduplicated Event Types

  1. Connection Rejections (connection-rejected):

    • Groups by rejection reason (global-limit, route-limit, etc.)
    • Example: "Rejected 150 connections (reasons: global-limit: 100, route-limit: 50)"
  2. IP Rejections (ip-rejected):

    • Groups by IP address
    • Shows top offenders with rejection counts and reasons
    • Example: "Rejected 500 connections from 10 IPs (top offenders: 192.168.1.100 (200x, rate-limit), ...)"
  3. Connection Cleanups (connection-cleanup):

    • Groups by cleanup reason (normal, timeout, error, zombie, etc.)
    • Example: "Cleaned up 250 connections (reasons: normal: 200, timeout: 30, error: 20)"
  4. IP Tracking Cleanup (ip-cleanup):

    • Summarizes periodic IP cleanup operations
    • Example: "IP tracking cleanup: removed 50 entries across 5 cleanup cycles"

Configuration

  • Default flush interval: 5 seconds
  • Maximum batch size: 100 events (triggers immediate flush)
  • Global periodic flush: Every 10 seconds (ensures logs are emitted regularly)
  • Process exit handling: Logs are flushed on SIGINT/SIGTERM

Benefits

  • Reduces log volume during attacks or high traffic
  • Provides better overview of patterns (e.g., which IPs are attacking)
  • Improves log readability and analysis
  • Prevents log storage overflow
  • Maintains detailed information in aggregated form

Log Output Examples

Instead of hundreds of individual logs:

Connection rejected
Connection rejected
Connection rejected
... (repeated 500 times)

You'll see:

[SUMMARY] Rejected 500 connections from 10 IPs in 5s (rate-limit: 350, per-ip-limit: 150) (top offenders: 192.168.1.100 (200x, rate-limit), 10.0.0.1 (150x, per-ip-limit))

Instead of:

Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 266
Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 265
... (repeated 266 times)

You'll see:

[SUMMARY] 266 HttpProxy connections terminated in 5s (reasons: client_closed: 266, activeConnections: 0)

Rapid Event Handling

  • During attacks or high-volume scenarios, logs are flushed more frequently
  • If 50+ events occur within 1 second, immediate flush is triggered
  • Prevents memory buildup during flooding attacks
  • Maintains real-time visibility during incidents