push.rocks/smartproxy

Fork 0

Files

Juergen Kunz 2d2e9e9475 feat(certificates): add custom provisioning option

2025-07-13 00:27:49 +00:00

15 KiB

Raw Blame History

SmartProxy Development Hints

Byte Tracking and Metrics

Throughput Drift Issue (Fixed)

Problem: Throughput numbers were gradually increasing over time for long-lived connections.

Root Cause: The byRoute() and byIP() methods were dividing cumulative total bytes (since connection start) by the window duration, causing rates to appear higher as connections aged:

Hour 1: 1GB total / 60s = 17 MB/s ✓
Hour 2: 2GB total / 60s = 34 MB/s ✗ (appears doubled!)
Hour 3: 3GB total / 60s = 50 MB/s ✗ (keeps rising!)

Solution: Implemented dedicated ThroughputTracker instances for each route and IP address:

Each route and IP gets its own throughput tracker with per-second sampling
Samples are taken every second and stored in a circular buffer
Rate calculations use actual samples within the requested window
Default window is now 1 second for real-time accuracy

What Gets Counted (Network Interface Throughput)

The byte tracking is designed to match network interface throughput (what Unifi/network monitoring tools show):

Counted bytes include:

All application data
TLS handshakes and protocol overhead
TLS record headers and encryption padding
HTTP headers and protocol data
WebSocket frames and protocol overhead
TLS alerts sent to clients

NOT counted:

PROXY protocol headers (sent to backend, not client)
TCP/IP headers (handled by OS, not visible at application layer)

Byte direction:

bytesReceived: All bytes received FROM the client on the incoming connection
bytesSent: All bytes sent TO the client on the incoming connection
Backend connections are separate and not mixed with client metrics

Double Counting Issue (Fixed)

Problem: Initial data chunks were being counted twice in the byte tracking:

Once when stored in pendingData in setupDirectConnection()
Again when the data flowed through bidirectional forwarding

Solution: Removed the byte counting when storing initial chunks. Bytes are now only counted when they actually flow through the setupBidirectionalForwarding() callbacks.

HttpProxy Metrics (Fixed)

Problem: HttpProxy forwarding was updating connection record byte counts but not calling metricsCollector.recordBytes(), resulting in missing throughput data.

Solution: Added metricsCollector.recordBytes() calls to the HttpProxy bidirectional forwarding callbacks.

Metrics Architecture

The metrics system has multiple layers:

Connection Records (record.bytesReceived/bytesSent): Track total bytes per connection
Global ThroughputTracker: Accumulates bytes between samples for overall rate calculations
Per-Route ThroughputTrackers: Dedicated tracker for each route with per-second sampling
Per-IP ThroughputTrackers: Dedicated tracker for each IP with per-second sampling
connectionByteTrackers: Track cumulative bytes and metadata for active connections

Key features:

All throughput trackers sample every second (1Hz)
Each tracker maintains a circular buffer of samples (default: 1 hour retention)
Rate calculations are accurate for any requested window (default: 1 second)
All byte counting happens exactly once at the data flow point
Unused route/IP trackers are automatically cleaned up when connections close

Understanding "High" Byte Counts

If byte counts seem high compared to actual application data, remember:

TLS handshakes can be 1-5KB depending on cipher suites and certificates
Each TLS record has 5 bytes of header overhead
TLS encryption adds 16-48 bytes of padding/MAC per record
HTTP/2 has additional framing overhead
WebSocket has frame headers (2-14 bytes per message)

This overhead is real network traffic and should be counted for accurate throughput metrics.

Byte Counting Paths

There are two mutually exclusive paths for connections:

Direct forwarding (route-connection-handler.ts):
- Used for TCP passthrough, TLS passthrough, and direct connections
- Bytes counted in setupBidirectionalForwarding callbacks
- Initial chunk NOT counted separately (flows through bidirectional forwarding)
HttpProxy forwarding (http-proxy-bridge.ts):
- Used for TLS termination (terminate, terminate-and-reencrypt)
- Initial chunk counted when written to proxy
- All subsequent bytes counted in setupBidirectionalForwarding callbacks
- This is the ONLY counting point for these connections

Byte Counting Audit (2025-01-06)

A comprehensive audit was performed to verify byte counting accuracy:

Audit Results:

✅ No double counting detected in any connection flow
✅ Each byte counted exactly once in each direction
✅ Connection records and metrics updated consistently
✅ PROXY protocol headers correctly excluded from client metrics
✅ NFTables forwarded connections correctly not counted (kernel handles)

Key Implementation Points:

All byte counting happens in only 2 files: route-connection-handler.ts and http-proxy-bridge.ts
Both use the same pattern: increment record.bytesReceived/Sent AND call metricsCollector.recordBytes()
Initial chunks handled correctly: stored but not counted until forwarded
TLS alerts counted as sent bytes (correct - they are sent to client)

For full audit details, see readme.byte-counting-audit.md

Connection Cleanup

Zombie Connection Detection

The connection manager performs comprehensive zombie detection every 10 seconds:

Full zombies: Both incoming and outgoing sockets destroyed but connection not cleaned up
Half zombies: One socket destroyed, grace period expired (5 minutes for TLS, 30 seconds for non-TLS)
Stuck connections: Data received but none sent back after threshold (5 minutes for TLS, 60 seconds for non-TLS)

Cleanup Queue

Connections are cleaned up through a batched queue system:

Batch size: 100 connections
Processing triggered immediately when batch size reached
Otherwise processed after 100ms delay
Prevents overwhelming the system during mass disconnections

Keep-Alive Handling

Keep-alive connections receive special treatment based on keepAliveTreatment setting:

standard: Normal timeout applies
extended: Timeout multiplied by keepAliveInactivityMultiplier (default 6x)
immortal: No timeout, connections persist indefinitely

PROXY Protocol

The system supports both receiving and sending PROXY protocol:

Receiving: Automatically detected from trusted proxy IPs (configured in proxyIPs)
Sending: Enabled per-route or globally via sendProxyProtocol setting
Real client IP is preserved and used for all connection tracking and security checks

Metrics and Throughput Calculation

The metrics system tracks throughput using per-second sampling:

Byte Recording: Bytes are recorded as data flows through connections
Sampling: Every second, accumulated bytes are stored as a sample
Rate Calculation: Throughput is calculated by summing bytes over a time window
Per-Route/IP Tracking: Separate ThroughputTracker instances for each route and IP

Key implementation details:

Bytes are recorded in the bidirectional forwarding callbacks
The instant() method returns throughput over the last 1 second
The recent() method returns throughput over the last 10 seconds
Custom windows can be specified for different averaging periods

Throughput Spikes Issue

There's a fundamental difference between application-layer and network-layer throughput:

Application Layer (what we measure):

Bytes are recorded when delivered to/from the application
Large chunks can arrive "instantly" due to kernel/Node.js buffering
Shows spikes when buffers are flushed (e.g., 20MB in 1 second = 160 Mbit/s)

Network Layer (what Unifi shows):

Actual packet flow through the network interface
Limited by physical network speed (e.g., 20 Mbit/s)
Data transfers over time, not in bursts

The spikes occur because:

Data flows over network at 20 Mbit/s (takes 8 seconds for 20MB)
Kernel/Node.js buffers this incoming data
When buffer is flushed, application receives large chunk at once
We record entire chunk in current second, creating artificial spike

Potential Solutions:

Use longer window for "instant" measurements (e.g., 5 seconds instead of 1)
Track socket write backpressure to estimate actual network flow
Implement bandwidth estimation based on connection duration
Accept that application-layer != network-layer throughput

Connection Limiting

Per-IP Connection Limits

SmartProxy tracks connections per IP address in the SecurityManager
Default limit is 100 connections per IP (configurable via maxConnectionsPerIP)
Connection rate limiting is also enforced (default 300 connections/minute per IP)
HttpProxy has been enhanced to also enforce per-IP limits when forwarding from SmartProxy

Route-Level Connection Limits

Routes can define security.maxConnections to limit connections per route
ConnectionManager tracks connections by route ID using a separate Map
Limits are enforced in RouteConnectionHandler before forwarding
Connection is tracked when route is matched: trackConnectionByRoute(routeId, connectionId)

HttpProxy Integration

When SmartProxy forwards to HttpProxy for TLS termination, it sends a CLIENT_IP:<ip>\r\n header
HttpProxy parses this header to track the real client IP, not the localhost IP
This ensures per-IP limits are enforced even for forwarded connections
The header is parsed in the connection handler before any data processing

Memory Optimization

Periodic cleanup runs every 60 seconds to remove:
- IPs with no active connections
- Expired rate limit timestamps (older than 1 minute)
Prevents memory accumulation from many unique IPs over time
Cleanup is automatic and runs in background with unref() to not keep process alive

Connection Cleanup Queue

Cleanup queue processes connections in batches to prevent overwhelming the system
Race condition prevention using isProcessingCleanup flag
Try-finally block ensures flag is always reset even if errors occur
New connections added during processing are queued for next batch

Important Implementation Notes

Always use NodeJS.Timeout type instead of NodeJS.Timer for interval/timeout references
IPv4/IPv6 normalization is handled (e.g., ::ffff:127.0.0.1 and 127.0.0.1 are treated as the same IP)
Connection limits are checked before route matching to prevent DoS attacks
SharedSecurityManager supports checking route-level limits via optional parameter

Log Deduplication

To reduce log spam during high-traffic scenarios or attacks, SmartProxy implements log deduplication for repetitive events:

How It Works

Similar log events are batched and aggregated over a 5-second window
Instead of logging each event individually, a summary is emitted
Events are grouped by type and deduplicated by key (e.g., IP address, reason)

Deduplicated Event Types

Connection Rejections (connection-rejected):
- Groups by rejection reason (global-limit, route-limit, etc.)
- Example: "Rejected 150 connections (reasons: global-limit: 100, route-limit: 50)"
IP Rejections (ip-rejected):
- Groups by IP address
- Shows top offenders with rejection counts and reasons
- Example: "Rejected 500 connections from 10 IPs (top offenders: 192.168.1.100 (200x, rate-limit), ...)"
Connection Cleanups (connection-cleanup):
- Groups by cleanup reason (normal, timeout, error, zombie, etc.)
- Example: "Cleaned up 250 connections (reasons: normal: 200, timeout: 30, error: 20)"
IP Tracking Cleanup (ip-cleanup):
- Summarizes periodic IP cleanup operations
- Example: "IP tracking cleanup: removed 50 entries across 5 cleanup cycles"

Configuration

Default flush interval: 5 seconds
Maximum batch size: 100 events (triggers immediate flush)
Global periodic flush: Every 10 seconds (ensures logs are emitted regularly)
Process exit handling: Logs are flushed on SIGINT/SIGTERM

Benefits

Reduces log volume during attacks or high traffic
Provides better overview of patterns (e.g., which IPs are attacking)
Improves log readability and analysis
Prevents log storage overflow
Maintains detailed information in aggregated form

Log Output Examples

Instead of hundreds of individual logs:

Connection rejected
Connection rejected
Connection rejected
... (repeated 500 times)

You'll see:

[SUMMARY] Rejected 500 connections from 10 IPs in 5s (rate-limit: 350, per-ip-limit: 150) (top offenders: 192.168.1.100 (200x, rate-limit), 10.0.0.1 (150x, per-ip-limit))

Instead of:

Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 266
Connection terminated: ::ffff:127.0.0.1 (client_closed). Active: 265
... (repeated 266 times)