358 lines
12 KiB
Markdown
358 lines
12 KiB
Markdown
# SmartProxy Metrics Improvement Plan
|
|
|
|
## Overview
|
|
|
|
The current `getThroughputRate()` implementation calculates cumulative throughput over a 60-second window rather than providing an actual rate, making metrics misleading for monitoring systems. This plan outlines a comprehensive redesign of the metrics system to provide accurate, time-series based metrics suitable for production monitoring.
|
|
|
|
## 1. Core Issues with Current Implementation
|
|
|
|
- **Cumulative vs Rate**: Current method accumulates all bytes from connections in the last minute rather than calculating actual throughput rate
|
|
- **No Time-Series Data**: Cannot track throughput changes over time
|
|
- **Inaccurate Estimates**: Attempting to estimate rates for older connections is fundamentally flawed
|
|
- **No Sliding Windows**: Cannot provide different time window views (1s, 10s, 60s, etc.)
|
|
- **Limited Granularity**: Only provides a single 60-second view
|
|
|
|
## 2. Proposed Architecture
|
|
|
|
### A. Time-Series Throughput Tracking
|
|
|
|
```typescript
|
|
interface IThroughputSample {
|
|
timestamp: number;
|
|
bytesIn: number;
|
|
bytesOut: number;
|
|
}
|
|
|
|
class ThroughputTracker {
|
|
private samples: IThroughputSample[] = [];
|
|
private readonly MAX_SAMPLES = 3600; // 1 hour at 1 sample/second
|
|
private lastSampleTime: number = 0;
|
|
private accumulatedBytesIn: number = 0;
|
|
private accumulatedBytesOut: number = 0;
|
|
|
|
// Called on every data transfer
|
|
public recordBytes(bytesIn: number, bytesOut: number): void {
|
|
this.accumulatedBytesIn += bytesIn;
|
|
this.accumulatedBytesOut += bytesOut;
|
|
}
|
|
|
|
// Called periodically (every second)
|
|
public takeSample(): void {
|
|
const now = Date.now();
|
|
|
|
// Record accumulated bytes since last sample
|
|
this.samples.push({
|
|
timestamp: now,
|
|
bytesIn: this.accumulatedBytesIn,
|
|
bytesOut: this.accumulatedBytesOut
|
|
});
|
|
|
|
// Reset accumulators
|
|
this.accumulatedBytesIn = 0;
|
|
this.accumulatedBytesOut = 0;
|
|
|
|
// Trim old samples
|
|
const cutoff = now - 3600000; // 1 hour
|
|
this.samples = this.samples.filter(s => s.timestamp > cutoff);
|
|
}
|
|
|
|
// Get rate over specified window
|
|
public getRate(windowSeconds: number): { bytesInPerSec: number; bytesOutPerSec: number } {
|
|
const now = Date.now();
|
|
const windowStart = now - (windowSeconds * 1000);
|
|
|
|
const relevantSamples = this.samples.filter(s => s.timestamp > windowStart);
|
|
|
|
if (relevantSamples.length === 0) {
|
|
return { bytesInPerSec: 0, bytesOutPerSec: 0 };
|
|
}
|
|
|
|
const totalBytesIn = relevantSamples.reduce((sum, s) => sum + s.bytesIn, 0);
|
|
const totalBytesOut = relevantSamples.reduce((sum, s) => sum + s.bytesOut, 0);
|
|
|
|
const actualWindow = (now - relevantSamples[0].timestamp) / 1000;
|
|
|
|
return {
|
|
bytesInPerSec: Math.round(totalBytesIn / actualWindow),
|
|
bytesOutPerSec: Math.round(totalBytesOut / actualWindow)
|
|
};
|
|
}
|
|
}
|
|
```
|
|
|
|
### B. Connection-Level Byte Tracking
|
|
|
|
```typescript
|
|
// In ConnectionRecord, add:
|
|
interface IConnectionRecord {
|
|
// ... existing fields ...
|
|
|
|
// Byte counters with timestamps
|
|
bytesReceivedHistory: Array<{ timestamp: number; bytes: number }>;
|
|
bytesSentHistory: Array<{ timestamp: number; bytes: number }>;
|
|
|
|
// For efficiency, could use circular buffer
|
|
lastBytesReceivedUpdate: number;
|
|
lastBytesSentUpdate: number;
|
|
}
|
|
```
|
|
|
|
### C. Enhanced Metrics Interface
|
|
|
|
```typescript
|
|
interface IMetrics {
|
|
// Connection metrics
|
|
connections: {
|
|
active(): number;
|
|
total(): number;
|
|
byRoute(): Map<string, number>;
|
|
byIP(): Map<string, number>;
|
|
topIPs(limit?: number): Array<{ ip: string; count: number }>;
|
|
};
|
|
|
|
// Throughput metrics (bytes per second)
|
|
throughput: {
|
|
instant(): { in: number; out: number }; // Last 1 second
|
|
recent(): { in: number; out: number }; // Last 10 seconds
|
|
average(): { in: number; out: number }; // Last 60 seconds
|
|
custom(seconds: number): { in: number; out: number };
|
|
history(seconds: number): Array<{ timestamp: number; in: number; out: number }>;
|
|
byRoute(windowSeconds?: number): Map<string, { in: number; out: number }>;
|
|
byIP(windowSeconds?: number): Map<string, { in: number; out: number }>;
|
|
};
|
|
|
|
// Request metrics
|
|
requests: {
|
|
perSecond(): number;
|
|
perMinute(): number;
|
|
total(): number;
|
|
};
|
|
|
|
// Cumulative totals
|
|
totals: {
|
|
bytesIn(): number;
|
|
bytesOut(): number;
|
|
connections(): number;
|
|
};
|
|
|
|
// Performance metrics
|
|
percentiles: {
|
|
connectionDuration(): { p50: number; p95: number; p99: number };
|
|
bytesTransferred(): {
|
|
in: { p50: number; p95: number; p99: number };
|
|
out: { p50: number; p95: number; p99: number };
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
## 3. Implementation Plan
|
|
|
|
### Current Status
|
|
- **Phase 1**: ~90% complete (core functionality implemented, tests need fixing)
|
|
- **Phase 2**: ~60% complete (main features done, percentiles pending)
|
|
- **Phase 3**: ~40% complete (basic optimizations in place)
|
|
- **Phase 4**: 0% complete (export formats not started)
|
|
|
|
### Phase 1: Core Throughput Tracking (Week 1)
|
|
- [x] Implement `ThroughputTracker` class
|
|
- [x] Integrate byte recording into socket data handlers
|
|
- [x] Add periodic sampling (1-second intervals)
|
|
- [x] Update `getThroughputRate()` to use time-series data (replaced with new clean API)
|
|
- [ ] Add unit tests for throughput tracking
|
|
|
|
### Phase 2: Enhanced Metrics (Week 2)
|
|
- [x] Add configurable time windows (1s, 10s, 60s, 5m, etc.)
|
|
- [ ] Implement percentile calculations
|
|
- [x] Add route-specific and IP-specific throughput tracking
|
|
- [x] Create historical data access methods
|
|
- [ ] Add integration tests
|
|
|
|
### Phase 3: Performance Optimization (Week 3)
|
|
- [x] Use circular buffers for efficiency
|
|
- [ ] Implement data aggregation for longer time windows
|
|
- [x] Add configurable retention periods
|
|
- [ ] Optimize memory usage
|
|
- [ ] Add performance benchmarks
|
|
|
|
### Phase 4: Export Formats (Week 4)
|
|
- [ ] Add Prometheus metric format with proper metric types
|
|
- [ ] Add StatsD format support
|
|
- [ ] Add JSON export with metadata
|
|
- [ ] Create OpenMetrics compatibility
|
|
- [ ] Add documentation and examples
|
|
|
|
## 4. Key Design Decisions
|
|
|
|
### A. Sampling Strategy
|
|
- **1-second samples** for fine-grained data
|
|
- **Aggregate to 1-minute** for longer retention
|
|
- **Keep 1 hour** of second-level data
|
|
- **Keep 24 hours** of minute-level data
|
|
|
|
### B. Memory Management
|
|
- **Circular buffers** for fixed memory usage
|
|
- **Configurable retention** periods
|
|
- **Lazy aggregation** for older data
|
|
- **Efficient data structures** (typed arrays for samples)
|
|
|
|
### C. Performance Considerations
|
|
- **Batch updates** during high throughput
|
|
- **Debounced calculations** for expensive metrics
|
|
- **Cached results** with TTL
|
|
- **Worker thread** option for heavy calculations
|
|
|
|
## 5. Configuration Options
|
|
|
|
```typescript
|
|
interface IMetricsConfig {
|
|
enabled: boolean;
|
|
|
|
// Sampling configuration
|
|
sampleIntervalMs: number; // Default: 1000 (1 second)
|
|
retentionSeconds: number; // Default: 3600 (1 hour)
|
|
|
|
// Performance tuning
|
|
enableDetailedTracking: boolean; // Per-connection byte history
|
|
enablePercentiles: boolean; // Calculate percentiles
|
|
cacheResultsMs: number; // Cache expensive calculations
|
|
|
|
// Export configuration
|
|
prometheusEnabled: boolean;
|
|
prometheusPath: string; // Default: /metrics
|
|
prometheusPrefix: string; // Default: smartproxy_
|
|
}
|
|
```
|
|
|
|
## 6. Example Usage
|
|
|
|
```typescript
|
|
const proxy = new SmartProxy({
|
|
metrics: {
|
|
enabled: true,
|
|
sampleIntervalMs: 1000,
|
|
enableDetailedTracking: true
|
|
}
|
|
});
|
|
|
|
// Get metrics instance
|
|
const metrics = proxy.getMetrics();
|
|
|
|
// Connection metrics
|
|
console.log(`Active connections: ${metrics.connections.active()}`);
|
|
console.log(`Total connections: ${metrics.connections.total()}`);
|
|
|
|
// Throughput metrics
|
|
const instant = metrics.throughput.instant();
|
|
console.log(`Current: ${instant.in} bytes/sec in, ${instant.out} bytes/sec out`);
|
|
|
|
const recent = metrics.throughput.recent(); // Last 10 seconds
|
|
const average = metrics.throughput.average(); // Last 60 seconds
|
|
|
|
// Custom time window
|
|
const custom = metrics.throughput.custom(30); // Last 30 seconds
|
|
|
|
// Historical data for graphing
|
|
const history = metrics.throughput.history(300); // Last 5 minutes
|
|
history.forEach(point => {
|
|
console.log(`${new Date(point.timestamp)}: ${point.in} bytes/sec in, ${point.out} bytes/sec out`);
|
|
});
|
|
|
|
// Top routes by throughput
|
|
const routeThroughput = metrics.throughput.byRoute(60);
|
|
routeThroughput.forEach((stats, route) => {
|
|
console.log(`Route ${route}: ${stats.in} bytes/sec in, ${stats.out} bytes/sec out`);
|
|
});
|
|
|
|
// Request metrics
|
|
console.log(`RPS: ${metrics.requests.perSecond()}`);
|
|
console.log(`RPM: ${metrics.requests.perMinute()}`);
|
|
|
|
// Totals
|
|
console.log(`Total bytes in: ${metrics.totals.bytesIn()}`);
|
|
console.log(`Total bytes out: ${metrics.totals.bytesOut()}`);
|
|
```
|
|
|
|
## 7. Prometheus Export Example
|
|
|
|
```
|
|
# HELP smartproxy_throughput_bytes_per_second Current throughput in bytes per second
|
|
# TYPE smartproxy_throughput_bytes_per_second gauge
|
|
smartproxy_throughput_bytes_per_second{direction="in",window="1s"} 1234567
|
|
smartproxy_throughput_bytes_per_second{direction="out",window="1s"} 987654
|
|
smartproxy_throughput_bytes_per_second{direction="in",window="10s"} 1134567
|
|
smartproxy_throughput_bytes_per_second{direction="out",window="10s"} 887654
|
|
|
|
# HELP smartproxy_bytes_total Total bytes transferred
|
|
# TYPE smartproxy_bytes_total counter
|
|
smartproxy_bytes_total{direction="in"} 123456789
|
|
smartproxy_bytes_total{direction="out"} 98765432
|
|
|
|
# HELP smartproxy_active_connections Current number of active connections
|
|
# TYPE smartproxy_active_connections gauge
|
|
smartproxy_active_connections 42
|
|
|
|
# HELP smartproxy_connection_duration_seconds Connection duration in seconds
|
|
# TYPE smartproxy_connection_duration_seconds histogram
|
|
smartproxy_connection_duration_seconds_bucket{le="0.1"} 100
|
|
smartproxy_connection_duration_seconds_bucket{le="1"} 500
|
|
smartproxy_connection_duration_seconds_bucket{le="10"} 800
|
|
smartproxy_connection_duration_seconds_bucket{le="+Inf"} 850
|
|
smartproxy_connection_duration_seconds_sum 4250
|
|
smartproxy_connection_duration_seconds_count 850
|
|
```
|
|
|
|
## 8. Migration Strategy
|
|
|
|
### Breaking Changes
|
|
- Completely replace the old metrics API with the new clean design
|
|
- Remove all `get*` prefixed methods in favor of grouped properties
|
|
- Use simple `{ in, out }` objects instead of verbose property names
|
|
- Provide clear migration guide in documentation
|
|
|
|
### Implementation Approach
|
|
1. Create new `ThroughputTracker` class for time-series data
|
|
2. Implement new `IMetrics` interface with clean API
|
|
3. Replace `MetricsCollector` implementation entirely
|
|
4. Update all references to use new API
|
|
5. Add comprehensive tests for accuracy validation
|
|
|
|
## 9. Success Metrics
|
|
|
|
- **Accuracy**: Throughput metrics accurate within 1% of actual
|
|
- **Performance**: < 1% CPU overhead for metrics collection
|
|
- **Memory**: < 10MB memory usage for 1 hour of data
|
|
- **Latency**: < 1ms to retrieve any metric
|
|
- **Reliability**: No metrics data loss under load
|
|
|
|
## 10. Future Enhancements
|
|
|
|
### Phase 5: Advanced Analytics
|
|
- Anomaly detection for traffic patterns
|
|
- Predictive analytics for capacity planning
|
|
- Correlation analysis between routes
|
|
- Real-time alerting integration
|
|
|
|
### Phase 6: Distributed Metrics
|
|
- Metrics aggregation across multiple proxies
|
|
- Distributed time-series storage
|
|
- Cross-proxy analytics
|
|
- Global dashboard support
|
|
|
|
## 11. Risks and Mitigations
|
|
|
|
### Risk: Memory Usage
|
|
- **Mitigation**: Circular buffers and configurable retention
|
|
- **Monitoring**: Track memory usage per metric type
|
|
|
|
### Risk: Performance Impact
|
|
- **Mitigation**: Efficient data structures and caching
|
|
- **Testing**: Load test with metrics enabled/disabled
|
|
|
|
### Risk: Data Accuracy
|
|
- **Mitigation**: Atomic operations and proper synchronization
|
|
- **Validation**: Compare with external monitoring tools
|
|
|
|
## Conclusion
|
|
|
|
This plan transforms SmartProxy's metrics from a basic cumulative system to a comprehensive, time-series based monitoring solution suitable for production environments. The phased approach ensures minimal disruption while delivering immediate value through accurate throughput measurements. |