fix(metrics): improve metrics
This commit is contained in:
358
readme.plan.md
Normal file
358
readme.plan.md
Normal file
@ -0,0 +1,358 @@
|
||||
# SmartProxy Metrics Improvement Plan
|
||||
|
||||
## Overview
|
||||
|
||||
The current `getThroughputRate()` implementation calculates cumulative throughput over a 60-second window rather than providing an actual rate, making metrics misleading for monitoring systems. This plan outlines a comprehensive redesign of the metrics system to provide accurate, time-series based metrics suitable for production monitoring.
|
||||
|
||||
## 1. Core Issues with Current Implementation
|
||||
|
||||
- **Cumulative vs Rate**: Current method accumulates all bytes from connections in the last minute rather than calculating actual throughput rate
|
||||
- **No Time-Series Data**: Cannot track throughput changes over time
|
||||
- **Inaccurate Estimates**: Attempting to estimate rates for older connections is fundamentally flawed
|
||||
- **No Sliding Windows**: Cannot provide different time window views (1s, 10s, 60s, etc.)
|
||||
- **Limited Granularity**: Only provides a single 60-second view
|
||||
|
||||
## 2. Proposed Architecture
|
||||
|
||||
### A. Time-Series Throughput Tracking
|
||||
|
||||
```typescript
|
||||
interface IThroughputSample {
|
||||
timestamp: number;
|
||||
bytesIn: number;
|
||||
bytesOut: number;
|
||||
}
|
||||
|
||||
class ThroughputTracker {
|
||||
private samples: IThroughputSample[] = [];
|
||||
private readonly MAX_SAMPLES = 3600; // 1 hour at 1 sample/second
|
||||
private lastSampleTime: number = 0;
|
||||
private accumulatedBytesIn: number = 0;
|
||||
private accumulatedBytesOut: number = 0;
|
||||
|
||||
// Called on every data transfer
|
||||
public recordBytes(bytesIn: number, bytesOut: number): void {
|
||||
this.accumulatedBytesIn += bytesIn;
|
||||
this.accumulatedBytesOut += bytesOut;
|
||||
}
|
||||
|
||||
// Called periodically (every second)
|
||||
public takeSample(): void {
|
||||
const now = Date.now();
|
||||
|
||||
// Record accumulated bytes since last sample
|
||||
this.samples.push({
|
||||
timestamp: now,
|
||||
bytesIn: this.accumulatedBytesIn,
|
||||
bytesOut: this.accumulatedBytesOut
|
||||
});
|
||||
|
||||
// Reset accumulators
|
||||
this.accumulatedBytesIn = 0;
|
||||
this.accumulatedBytesOut = 0;
|
||||
|
||||
// Trim old samples
|
||||
const cutoff = now - 3600000; // 1 hour
|
||||
this.samples = this.samples.filter(s => s.timestamp > cutoff);
|
||||
}
|
||||
|
||||
// Get rate over specified window
|
||||
public getRate(windowSeconds: number): { bytesInPerSec: number; bytesOutPerSec: number } {
|
||||
const now = Date.now();
|
||||
const windowStart = now - (windowSeconds * 1000);
|
||||
|
||||
const relevantSamples = this.samples.filter(s => s.timestamp > windowStart);
|
||||
|
||||
if (relevantSamples.length === 0) {
|
||||
return { bytesInPerSec: 0, bytesOutPerSec: 0 };
|
||||
}
|
||||
|
||||
const totalBytesIn = relevantSamples.reduce((sum, s) => sum + s.bytesIn, 0);
|
||||
const totalBytesOut = relevantSamples.reduce((sum, s) => sum + s.bytesOut, 0);
|
||||
|
||||
const actualWindow = (now - relevantSamples[0].timestamp) / 1000;
|
||||
|
||||
return {
|
||||
bytesInPerSec: Math.round(totalBytesIn / actualWindow),
|
||||
bytesOutPerSec: Math.round(totalBytesOut / actualWindow)
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### B. Connection-Level Byte Tracking
|
||||
|
||||
```typescript
|
||||
// In ConnectionRecord, add:
|
||||
interface IConnectionRecord {
|
||||
// ... existing fields ...
|
||||
|
||||
// Byte counters with timestamps
|
||||
bytesReceivedHistory: Array<{ timestamp: number; bytes: number }>;
|
||||
bytesSentHistory: Array<{ timestamp: number; bytes: number }>;
|
||||
|
||||
// For efficiency, could use circular buffer
|
||||
lastBytesReceivedUpdate: number;
|
||||
lastBytesSentUpdate: number;
|
||||
}
|
||||
```
|
||||
|
||||
### C. Enhanced Metrics Interface
|
||||
|
||||
```typescript
|
||||
interface IMetrics {
|
||||
// Connection metrics
|
||||
connections: {
|
||||
active(): number;
|
||||
total(): number;
|
||||
byRoute(): Map<string, number>;
|
||||
byIP(): Map<string, number>;
|
||||
topIPs(limit?: number): Array<{ ip: string; count: number }>;
|
||||
};
|
||||
|
||||
// Throughput metrics (bytes per second)
|
||||
throughput: {
|
||||
instant(): { in: number; out: number }; // Last 1 second
|
||||
recent(): { in: number; out: number }; // Last 10 seconds
|
||||
average(): { in: number; out: number }; // Last 60 seconds
|
||||
custom(seconds: number): { in: number; out: number };
|
||||
history(seconds: number): Array<{ timestamp: number; in: number; out: number }>;
|
||||
byRoute(windowSeconds?: number): Map<string, { in: number; out: number }>;
|
||||
byIP(windowSeconds?: number): Map<string, { in: number; out: number }>;
|
||||
};
|
||||
|
||||
// Request metrics
|
||||
requests: {
|
||||
perSecond(): number;
|
||||
perMinute(): number;
|
||||
total(): number;
|
||||
};
|
||||
|
||||
// Cumulative totals
|
||||
totals: {
|
||||
bytesIn(): number;
|
||||
bytesOut(): number;
|
||||
connections(): number;
|
||||
};
|
||||
|
||||
// Performance metrics
|
||||
percentiles: {
|
||||
connectionDuration(): { p50: number; p95: number; p99: number };
|
||||
bytesTransferred(): {
|
||||
in: { p50: number; p95: number; p99: number };
|
||||
out: { p50: number; p95: number; p99: number };
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
## 3. Implementation Plan
|
||||
|
||||
### Current Status
|
||||
- **Phase 1**: ~90% complete (core functionality implemented, tests need fixing)
|
||||
- **Phase 2**: ~60% complete (main features done, percentiles pending)
|
||||
- **Phase 3**: ~40% complete (basic optimizations in place)
|
||||
- **Phase 4**: 0% complete (export formats not started)
|
||||
|
||||
### Phase 1: Core Throughput Tracking (Week 1)
|
||||
- [x] Implement `ThroughputTracker` class
|
||||
- [x] Integrate byte recording into socket data handlers
|
||||
- [x] Add periodic sampling (1-second intervals)
|
||||
- [x] Update `getThroughputRate()` to use time-series data (replaced with new clean API)
|
||||
- [ ] Add unit tests for throughput tracking
|
||||
|
||||
### Phase 2: Enhanced Metrics (Week 2)
|
||||
- [x] Add configurable time windows (1s, 10s, 60s, 5m, etc.)
|
||||
- [ ] Implement percentile calculations
|
||||
- [x] Add route-specific and IP-specific throughput tracking
|
||||
- [x] Create historical data access methods
|
||||
- [ ] Add integration tests
|
||||
|
||||
### Phase 3: Performance Optimization (Week 3)
|
||||
- [x] Use circular buffers for efficiency
|
||||
- [ ] Implement data aggregation for longer time windows
|
||||
- [x] Add configurable retention periods
|
||||
- [ ] Optimize memory usage
|
||||
- [ ] Add performance benchmarks
|
||||
|
||||
### Phase 4: Export Formats (Week 4)
|
||||
- [ ] Add Prometheus metric format with proper metric types
|
||||
- [ ] Add StatsD format support
|
||||
- [ ] Add JSON export with metadata
|
||||
- [ ] Create OpenMetrics compatibility
|
||||
- [ ] Add documentation and examples
|
||||
|
||||
## 4. Key Design Decisions
|
||||
|
||||
### A. Sampling Strategy
|
||||
- **1-second samples** for fine-grained data
|
||||
- **Aggregate to 1-minute** for longer retention
|
||||
- **Keep 1 hour** of second-level data
|
||||
- **Keep 24 hours** of minute-level data
|
||||
|
||||
### B. Memory Management
|
||||
- **Circular buffers** for fixed memory usage
|
||||
- **Configurable retention** periods
|
||||
- **Lazy aggregation** for older data
|
||||
- **Efficient data structures** (typed arrays for samples)
|
||||
|
||||
### C. Performance Considerations
|
||||
- **Batch updates** during high throughput
|
||||
- **Debounced calculations** for expensive metrics
|
||||
- **Cached results** with TTL
|
||||
- **Worker thread** option for heavy calculations
|
||||
|
||||
## 5. Configuration Options
|
||||
|
||||
```typescript
|
||||
interface IMetricsConfig {
|
||||
enabled: boolean;
|
||||
|
||||
// Sampling configuration
|
||||
sampleIntervalMs: number; // Default: 1000 (1 second)
|
||||
retentionSeconds: number; // Default: 3600 (1 hour)
|
||||
|
||||
// Performance tuning
|
||||
enableDetailedTracking: boolean; // Per-connection byte history
|
||||
enablePercentiles: boolean; // Calculate percentiles
|
||||
cacheResultsMs: number; // Cache expensive calculations
|
||||
|
||||
// Export configuration
|
||||
prometheusEnabled: boolean;
|
||||
prometheusPath: string; // Default: /metrics
|
||||
prometheusPrefix: string; // Default: smartproxy_
|
||||
}
|
||||
```
|
||||
|
||||
## 6. Example Usage
|
||||
|
||||
```typescript
|
||||
const proxy = new SmartProxy({
|
||||
metrics: {
|
||||
enabled: true,
|
||||
sampleIntervalMs: 1000,
|
||||
enableDetailedTracking: true
|
||||
}
|
||||
});
|
||||
|
||||
// Get metrics instance
|
||||
const metrics = proxy.getMetrics();
|
||||
|
||||
// Connection metrics
|
||||
console.log(`Active connections: ${metrics.connections.active()}`);
|
||||
console.log(`Total connections: ${metrics.connections.total()}`);
|
||||
|
||||
// Throughput metrics
|
||||
const instant = metrics.throughput.instant();
|
||||
console.log(`Current: ${instant.in} bytes/sec in, ${instant.out} bytes/sec out`);
|
||||
|
||||
const recent = metrics.throughput.recent(); // Last 10 seconds
|
||||
const average = metrics.throughput.average(); // Last 60 seconds
|
||||
|
||||
// Custom time window
|
||||
const custom = metrics.throughput.custom(30); // Last 30 seconds
|
||||
|
||||
// Historical data for graphing
|
||||
const history = metrics.throughput.history(300); // Last 5 minutes
|
||||
history.forEach(point => {
|
||||
console.log(`${new Date(point.timestamp)}: ${point.in} bytes/sec in, ${point.out} bytes/sec out`);
|
||||
});
|
||||
|
||||
// Top routes by throughput
|
||||
const routeThroughput = metrics.throughput.byRoute(60);
|
||||
routeThroughput.forEach((stats, route) => {
|
||||
console.log(`Route ${route}: ${stats.in} bytes/sec in, ${stats.out} bytes/sec out`);
|
||||
});
|
||||
|
||||
// Request metrics
|
||||
console.log(`RPS: ${metrics.requests.perSecond()}`);
|
||||
console.log(`RPM: ${metrics.requests.perMinute()}`);
|
||||
|
||||
// Totals
|
||||
console.log(`Total bytes in: ${metrics.totals.bytesIn()}`);
|
||||
console.log(`Total bytes out: ${metrics.totals.bytesOut()}`);
|
||||
```
|
||||
|
||||
## 7. Prometheus Export Example
|
||||
|
||||
```
|
||||
# HELP smartproxy_throughput_bytes_per_second Current throughput in bytes per second
|
||||
# TYPE smartproxy_throughput_bytes_per_second gauge
|
||||
smartproxy_throughput_bytes_per_second{direction="in",window="1s"} 1234567
|
||||
smartproxy_throughput_bytes_per_second{direction="out",window="1s"} 987654
|
||||
smartproxy_throughput_bytes_per_second{direction="in",window="10s"} 1134567
|
||||
smartproxy_throughput_bytes_per_second{direction="out",window="10s"} 887654
|
||||
|
||||
# HELP smartproxy_bytes_total Total bytes transferred
|
||||
# TYPE smartproxy_bytes_total counter
|
||||
smartproxy_bytes_total{direction="in"} 123456789
|
||||
smartproxy_bytes_total{direction="out"} 98765432
|
||||
|
||||
# HELP smartproxy_active_connections Current number of active connections
|
||||
# TYPE smartproxy_active_connections gauge
|
||||
smartproxy_active_connections 42
|
||||
|
||||
# HELP smartproxy_connection_duration_seconds Connection duration in seconds
|
||||
# TYPE smartproxy_connection_duration_seconds histogram
|
||||
smartproxy_connection_duration_seconds_bucket{le="0.1"} 100
|
||||
smartproxy_connection_duration_seconds_bucket{le="1"} 500
|
||||
smartproxy_connection_duration_seconds_bucket{le="10"} 800
|
||||
smartproxy_connection_duration_seconds_bucket{le="+Inf"} 850
|
||||
smartproxy_connection_duration_seconds_sum 4250
|
||||
smartproxy_connection_duration_seconds_count 850
|
||||
```
|
||||
|
||||
## 8. Migration Strategy
|
||||
|
||||
### Breaking Changes
|
||||
- Completely replace the old metrics API with the new clean design
|
||||
- Remove all `get*` prefixed methods in favor of grouped properties
|
||||
- Use simple `{ in, out }` objects instead of verbose property names
|
||||
- Provide clear migration guide in documentation
|
||||
|
||||
### Implementation Approach
|
||||
1. Create new `ThroughputTracker` class for time-series data
|
||||
2. Implement new `IMetrics` interface with clean API
|
||||
3. Replace `MetricsCollector` implementation entirely
|
||||
4. Update all references to use new API
|
||||
5. Add comprehensive tests for accuracy validation
|
||||
|
||||
## 9. Success Metrics
|
||||
|
||||
- **Accuracy**: Throughput metrics accurate within 1% of actual
|
||||
- **Performance**: < 1% CPU overhead for metrics collection
|
||||
- **Memory**: < 10MB memory usage for 1 hour of data
|
||||
- **Latency**: < 1ms to retrieve any metric
|
||||
- **Reliability**: No metrics data loss under load
|
||||
|
||||
## 10. Future Enhancements
|
||||
|
||||
### Phase 5: Advanced Analytics
|
||||
- Anomaly detection for traffic patterns
|
||||
- Predictive analytics for capacity planning
|
||||
- Correlation analysis between routes
|
||||
- Real-time alerting integration
|
||||
|
||||
### Phase 6: Distributed Metrics
|
||||
- Metrics aggregation across multiple proxies
|
||||
- Distributed time-series storage
|
||||
- Cross-proxy analytics
|
||||
- Global dashboard support
|
||||
|
||||
## 11. Risks and Mitigations
|
||||
|
||||
### Risk: Memory Usage
|
||||
- **Mitigation**: Circular buffers and configurable retention
|
||||
- **Monitoring**: Track memory usage per metric type
|
||||
|
||||
### Risk: Performance Impact
|
||||
- **Mitigation**: Efficient data structures and caching
|
||||
- **Testing**: Load test with metrics enabled/disabled
|
||||
|
||||
### Risk: Data Accuracy
|
||||
- **Mitigation**: Atomic operations and proper synchronization
|
||||
- **Validation**: Compare with external monitoring tools
|
||||
|
||||
## Conclusion
|
||||
|
||||
This plan transforms SmartProxy's metrics from a basic cumulative system to a comprehensive, time-series based monitoring solution suitable for production environments. The phased approach ensures minimal disruption while delivering immediate value through accurate throughput measurements.
|
Reference in New Issue
Block a user