Skip to main content
Get endpoint-level summary metrics plus a time series over a lookback window.

Run

curl -s "https://api.wafer.ai/v1/endpoints/metrics?endpoint=<ENDPOINT_HOST>&range_minutes=<RANGE_MINUTES>" \
  -H "Authorization: Bearer <API_KEY>"

Parameters

  • endpoint (required): <ENDPOINT_HOST>
  • range_minutes (optional): 5, 15, 30, 60, 360, 1440, 10080, 43200
  • model (optional): filter to one model_resolved value (case-sensitive)
  • default range_minutes: 60

Response Shape

{
  "endpoint": "example-endpoint.wafer.ai",
  "range_minutes": 60,
  "queried_at": "2026-05-01T15:55:19+00:00",
  "summary": {
    "total_requests": 1200,
    "rps": 0.33,
    "ttft_p50_ms": 410.0,
    "ttft_p90_ms": 1250.0,
    "ttft_p95_ms": 2100.0,
    "ttft_p99_ms": 3200.0,
    "tps_p10": 48.0,
    "tps_p50": 92.0,
    "tps_p90": 155.0,
    "tps_p99": 180.0,
    "latency_p50_ms": 2100.0,
    "latency_p90_ms": 6400.0,
    "latency_p99_ms": 12000.0,
    "cache_hit_pct": 72.5,
    "total_input_tokens": 4800000,
    "total_output_tokens": 420000,
    "total_cache_read_tokens": 3480000,
    "count_2xx": 1184,
    "count_4xx": 16,
    "count_5xx": 0,
    "error_rate_pct": 1.3,
    "concurrent_requests": 12,
    "active_accounts": 1,
    "engine_running_requests": 2.0,
    "engine_queue_depth": 0.0,
    "engine_kv_cache_hit_rate_pct": 68.0,
    "engine_kv_cache_usage_pct": 42.0,
    "engine_preemptions_in_range": 0.0
  },
  "backends": [
    {
      "backend_id": "backend-a",
      "active_requests": null,
      "engine_running_requests": 2.0,
      "engine_queue_depth": 1.0,
      "is_healthy": true
    }
  ],
  "timeseries": [
    {
      "time": "2026-05-01T14:55:00Z",
      "requests": 120,
      "tps_p10": 48.0,
      "tps_p50": 92.0,
      "tps_p90": 155.0,
      "tps_p99": 180.0,
      "ttft_p50_ms": 410.0,
      "ttft_p90_ms": 1250.0,
      "ttft_p95_ms": 2100.0,
      "ttft_p99_ms": 3200.0,
      "latency_p50_ms": 2100.0,
      "latency_p90_ms": 6400.0,
      "latency_p99_ms": 12000.0,
      "input_tokens": 480000,
      "cache_read_tokens": 348000,
      "error_count": 1
    }
  ]
}

Key Fields

  • rps: average requests per second across the full window
  • ttft_p50_ms, ttft_p90_ms, ttft_p95_ms, ttft_p99_ms: streaming TTFT percentiles (null on non-streaming windows)
  • tps_p10, tps_p50, tps_p90, tps_p99: output tokens-per-second percentiles
  • latency_p50_ms, latency_p90_ms, latency_p99_ms: end-to-end latency percentiles
  • cache_hit_pct: total_cache_read_tokens / total_input_tokens * 100, rounded to 1 decimal
  • total_input_tokens, total_output_tokens, total_cache_read_tokens: raw token sums across the window
  • count_2xx, count_4xx, count_5xx: request count by status class
  • error_rate_pct: 4xx + 5xx divided by total requests
  • concurrent_requests: recent activity count, not an exact inflight gauge
  • active_accounts: distinct API keys active in the last 2 minutes
  • engine_running_requests: sum of in-flight requests across all backend engines
  • engine_queue_depth: sum of queued requests across all backend engines
  • engine_kv_cache_hit_rate_pct: average KV cache hit rate across backends (percentage)
  • engine_kv_cache_usage_pct: average KV cache utilization across backends (percentage)
  • engine_preemptions_in_range: total engine preemptions during the lookback window
  • backends[*].active_requests: live wafer-edge backend inflight gauge when emitted by the endpoint
  • backends[*].engine_running_requests: live engine running request gauge for that backend/node
  • backends[*].engine_queue_depth: live engine queue depth gauge for that backend/node
  • timeseries[*]: per-bucket requests, full TPS / TTFT / latency percentile set, input_tokens, cache_read_tokens, and error_count. Bin size is derived from range_minutes (e.g. 1m for 5-min windows, 30m for 24-hour windows). The two token sums let you render a “cached vs total input” overlay without a second query.

Timeseries Bin Sizes

The timeseries bin width is derived from range_minutes:
range_minutesBin size
5, 151 minute
302 minutes
605 minutes
360 (6h)15 minutes
1440 (24h)30 minutes
10080 (7d)3 hours
43200 (30d)12 hours
Pick the smallest range_minutes that covers your window — longer ranges hand back coarser bins.

Errors

  • 401: missing or invalid API key
  • 403: API key does not have access to the requested endpoint
  • 422: invalid range_minutes or invalid query shape
  • 502: upstream metrics query failed

Drill Into a Spike

When an aggregate metric looks bad (high ttft_p95_ms / ttft_p99_ms, low cache_hit_pct, etc.), the next step is to pull the specific requests behind it. The matching recipes live in Request Inspection → Common Workflows:
  • Tail-latencyttft_p95_ms elevated → GET /requests?min_ttft_ms=…GET /requests/{id}.
  • Low-cache-hit large requestscache_hit_pct looks healthy but you want the specific large requests that missedGET /requests?max_cache_hit_pct=5&min_input_tokens=20000.
  • Streaming healthis_streaming=true with ttft_ms=null means the stream broke before the first token; use errors_only=true.

Notes

  • Metrics are endpoint-scoped. You must pass an endpoint your key can access.
  • The model filter is case-sensitive and matches against the resolved model ID (the one a backend actually served), not the requested model alias.
  • For per-request debugging, use Request Inspection.