Run
Auth
Authorization: Bearer <API_KEY> — the same dedicated-endpoint key you use for inference. Each key gets its own rate-limit bucket, so a single Grafana agent can scrape multiple endpoints from one egress IP without contending for buckets.
Scrape Interval & Rate Limit
- Recommended scrape interval: 60s
- Rate limit: 6 req/min per API key (HTTP 429 on overrun)
Available
- Dedicated endpoints only.
- Serverless (
pass.wafer.ai) keys return 403 — the shared-fleet metrics aren’t meaningfully scoped to a single customer’s key. Use the Billing API for serverless usage.
Metrics
All series carry aendpoint="<your-endpoint>" label. The _last_24h suffix marks a 24h-rolling gauge (not a true monotonic counter — divide by 86400 for an average rate, or alert on absolute counts).
Request volume
wafer_requests_last_24h{status_class}— counts by2xx/4xx/5xxwafer_input_tokens_last_24h— total input tokens billedwafer_output_tokens_last_24h— total output tokens billedwafer_cache_read_tokens_last_24h— input tokens served from prompt cachewafer_cache_hit_ratio_last_24h—0.0–1.0,cache_read / input_tokens
Latency (24h-window percentile gauges)
wafer_ttft_seconds{quantile}—0.5/0.9/0.95/0.99. Streaming requests only.wafer_request_duration_seconds{quantile}—0.5/0.9/0.99. End-to-end wall time.wafer_output_tokens_per_second{quantile}—0.1/0.5/0.9/0.99. Per-request output throughput.
Live engine state
These reflect the most recent ~1 min from the worker fleet. If a worker stopped reporting more than 10 min ago, the series drops out of the scrape (rather than continuing to show stale state).wafer_engine_running_requests— in-flight requests, summed across replicas. Equivalent to sglang’snum_running_reqs.wafer_engine_queue_depth— queued requests waiting for engine admission, summed across replicas.wafer_engine_kv_cache_usage_ratio— KV cache utilization, averaged across replicas.0.0–1.0.wafer_engine_kv_cache_hit_ratio— engine-side KV cache hit rate.0.0–1.0. Note: this is the engine’s internal cache metric and may diverge fromwafer_cache_hit_ratio_last_24h, which is computed from billed token counts.
Activity
wafer_concurrent_requests— distinct requests in flight in the last 2 min. Coarse “is the endpoint busy” gauge.wafer_metrics_window_seconds— constant86400, exposed so dashboards can label tiles correctly if we ever change the aggregation window.
Prometheus Config
Grafana Cloud Agent
Same shape, with the key in theheaders block:
OTel Collector
Use the Prometheus receiver:Datadog Agent
Use theopenmetrics check:
Why gauges instead of counters?
We expose_last_24h gauges instead of monotonic _total counters because the underlying log retention is finite — a “since-beginning-of-time” counter would silently start dropping events once they aged out, which is worse than an honestly-named rolling window. For an average request rate, divide by 86400; for spike detection, alert on the absolute value relative to your baseline.
Why no histogram buckets?
The percentile series (_seconds, _per_second) are gauges with a quantile label, not Prometheus histograms. The backing query returns precomputed percentiles directly; rendering full histograms would require 8–12 additional queries per scrape and wouldn’t fit the rate-limit budget. You can chart and alert on the existing quantiles; PromQL histogram_quantile() doesn’t apply.
Errors
401: missing or invalid API key403: dedicated-only — serverless keys, or a dedicated key against an endpoint it doesn’t own429: rate limit (6/min per key) — back off and retry5xx: backing query failed. The engine-gauge query failing degrades gracefully (you’ll still get the 24h-window series); the heavy summary query failing surfaces as a 5xx so your Prometheus target shows as down rather than a misleading 200 with zeros.
Notes
- The 24h window matches the dashboard’s default range, so values in PromQL line up with what you see in the Wafer UI.
- Cardinality is intentionally bounded:
endpoint(always), plusstatus_class(3 values) orquantile(3–4 values) on the partitioned series. No per-model or per-request labels in v1. - For per-request debugging, use Request Inspection. For the same data as JSON, use the Metrics API.