Run
Parameters
endpoint(required):<ENDPOINT_HOST>range_minutes(optional):5,15,30,60,360,1440,10080,43200model(optional): filter to onemodel_resolvedvalue (case-sensitive)- default
range_minutes:60
Response Shape
Key Fields
rps: average requests per second across the full windowttft_p50_ms,ttft_p90_ms,ttft_p95_ms,ttft_p99_ms: streaming TTFT percentiles (nullon non-streaming windows)tps_p10,tps_p50,tps_p90,tps_p99: output tokens-per-second percentileslatency_p50_ms,latency_p90_ms,latency_p99_ms: end-to-end latency percentilescache_hit_pct:total_cache_read_tokens / total_input_tokens * 100, rounded to 1 decimaltotal_input_tokens,total_output_tokens,total_cache_read_tokens: raw token sums across the windowcount_2xx,count_4xx,count_5xx: request count by status classerror_rate_pct:4xx + 5xxdivided by total requestsconcurrent_requests: recent activity count, not an exact inflight gaugeactive_accounts: distinct API keys active in the last 2 minutesengine_running_requests: sum of in-flight requests across all backend enginesengine_queue_depth: sum of queued requests across all backend enginesengine_kv_cache_hit_rate_pct: average KV cache hit rate across backends (percentage)engine_kv_cache_usage_pct: average KV cache utilization across backends (percentage)engine_preemptions_in_range: total engine preemptions during the lookback windowbackends[*].active_requests: live wafer-edge backend inflight gauge when emitted by the endpointbackends[*].engine_running_requests: live engine running request gauge for that backend/nodebackends[*].engine_queue_depth: live engine queue depth gauge for that backend/nodetimeseries[*]: per-bucketrequests, full TPS / TTFT / latency percentile set,input_tokens,cache_read_tokens, anderror_count. Bin size is derived fromrange_minutes(e.g.1mfor 5-min windows,30mfor 24-hour windows). The two token sums let you render a “cached vs total input” overlay without a second query.
Timeseries Bin Sizes
Thetimeseries bin width is derived from range_minutes:
range_minutes | Bin size |
|---|---|
5, 15 | 1 minute |
30 | 2 minutes |
60 | 5 minutes |
360 (6h) | 15 minutes |
1440 (24h) | 30 minutes |
10080 (7d) | 3 hours |
43200 (30d) | 12 hours |
range_minutes that covers your window — longer ranges hand back coarser bins.
Errors
401: missing or invalid API key403: API key does not have access to the requested endpoint422: invalidrange_minutesor invalid query shape502: upstream metrics query failed
Drill Into a Spike
When an aggregate metric looks bad (highttft_p95_ms / ttft_p99_ms, low cache_hit_pct, etc.), the next step is to pull the specific requests behind it. The matching recipes live in Request Inspection → Common Workflows:
- Tail-latency —
ttft_p95_mselevated →GET /requests?min_ttft_ms=…→GET /requests/{id}. - Low-cache-hit large requests —
cache_hit_pctlooks healthy but you want the specific large requests that missed →GET /requests?max_cache_hit_pct=5&min_input_tokens=20000. - Streaming health —
is_streaming=truewithttft_ms=nullmeans the stream broke before the first token; useerrors_only=true.
Notes
- Metrics are endpoint-scoped. You must pass an
endpointyour key can access. - The
modelfilter is case-sensitive and matches against the resolved model ID (the one a backend actually served), not the requested model alias. - For per-request debugging, use Request Inspection.