Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.wafer.ai/llms.txt

Use this file to discover all available pages before exploring further.

Get per-account Serverless SLA metrics for traffic served through pass.wafer.ai. The response includes summary latency, TTFT, tokens-per-second, token counts, error counts, and a time series over the requested lookback window. Serverless metrics are scoped to the account that owns the Serverless API key. They do not expose fleet-wide backend identities, live backend gauges, or other customers’ traffic.

Run

curl -s "https://api.wafer.ai/v1/endpoints/metrics?endpoint=pass.wafer.ai&range_minutes=60&model=<MODEL_ID>" \
  -H "Authorization: Bearer <SERVERLESS_API_KEY>"

Parameters

  • endpoint (required): pass.wafer.ai
  • range_minutes (optional): 5, 15, 30, 60, 360, 1440, 10080, 43200
  • model (optional): filter to one resolved model, such as Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
  • default range_minutes: 60

Response Shape

{
  "endpoint": "pass.wafer.ai",
  "range_minutes": 60,
  "queried_at": "2026-05-22T19:55:19+00:00",
  "summary": {
    "total_requests": 42,
    "rps": 0.01,
    "ttft_p50_ms": 410.0,
    "ttft_p90_ms": 1250.0,
    "ttft_p99_ms": 3200.0,
    "tps_p10": 52.0,
    "tps_p50": 92.0,
    "tps_p90": 155.0,
    "tps_p99": 180.0,
    "latency_p50_ms": 2100.0,
    "latency_p90_ms": 6400.0,
    "latency_p99_ms": 12000.0,
    "cache_hit_pct": 72.5,
    "total_input_tokens": 480000,
    "total_output_tokens": 42000,
    "total_cache_read_tokens": 348000,
    "count_2xx": 41,
    "count_4xx": 1,
    "count_5xx": 0,
    "error_rate_pct": 2.4,
    "concurrent_requests": 2,
    "active_accounts": null,
    "engine_running_requests": null,
    "engine_queue_depth": null,
    "engine_kv_cache_hit_rate_pct": null,
    "engine_kv_cache_usage_pct": null,
    "engine_preemptions_in_range": null
  },
  "backends": [],
  "timeseries": [
    {
      "time": "2026-05-22T18:55:00Z",
      "requests": 12,
      "tps_p10": 50.0,
      "tps_p50": 92.0,
      "tps_p90": 150.0,
      "tps_p99": 175.0,
      "ttft_p50_ms": 410.0,
      "ttft_p90_ms": 1200.0,
      "ttft_p99_ms": 3000.0,
      "latency_p50_ms": 2100.0,
      "latency_p90_ms": 6400.0,
      "latency_p99_ms": 12000.0,
      "error_count": 1
    }
  ]
}

Key Fields

  • rps: average requests per second across the full window
  • ttft_p50_ms, ttft_p90_ms, ttft_p99_ms: streaming time-to-first-token percentiles
  • tps_p10, tps_p50, tps_p90, tps_p99: output tokens-per-second percentiles
  • latency_p50_ms, latency_p90_ms, latency_p99_ms: end-to-end latency percentiles
  • cache_hit_pct: cache-read prompt tokens divided by total input tokens
  • count_2xx, count_4xx, count_5xx: request count by status class
  • error_rate_pct: 4xx + 5xx divided by total requests
  • concurrent_requests: recent activity count for the scoped account
  • timeseries[*]: per-bucket requests, throughput, latency, TTFT, and error count

Serverless Scoping

  • Only Serverless API keys can query endpoint=pass.wafer.ai.
  • Results are scoped to the owning Serverless account for the bearer key.
  • model filters the resolved model that served the request.
  • backends is empty for Serverless because backend identities are fleet-level data.
  • Engine gauge fields and active_accounts are null for Serverless because they are fleet-level metrics.

Errors

  • 401: missing or invalid API key
  • 403: the API key is not a Serverless key for pass.wafer.ai
  • 422: invalid range_minutes, endpoint, or model
  • 502: upstream metrics query failed