Skip to main content
Wafer exposes a per-endpoint Prometheus scrape URL so you can wire your existing Grafana / Datadog / New Relic / OTel Collector setup to a dedicated endpoint without going through the JSON Metrics API.

Run

curl -s "https://api.wafer.ai/v1/endpoints/metrics/prometheus?endpoint=<ENDPOINT_HOST>" \
  -H "Authorization: Bearer <API_KEY>"
Response is the standard Prometheus 0.0.4 text-exposition format:
# HELP wafer_requests_last_24h Requests served in the last 24h, partitioned by HTTP status class. Rolling window — divide by 86400 for an average rate, or alert on absolute counts.
# TYPE wafer_requests_last_24h gauge
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="2xx"} 12000
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="4xx"} 300
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="5xx"} 45
# HELP wafer_ttft_seconds Time to first token, in seconds. Reported as quantiles (0.5 / 0.9 / 0.95 / 0.99) over the last 24h. Streaming requests only.
# TYPE wafer_ttft_seconds gauge
wafer_ttft_seconds{endpoint="example-endpoint.wafer.ai",quantile="0.5"} 0.12
wafer_ttft_seconds{endpoint="example-endpoint.wafer.ai",quantile="0.9"} 0.25
...

Auth

Authorization: Bearer <API_KEY> — the same dedicated-endpoint key you use for inference. Each key gets its own rate-limit bucket, so a single Grafana agent can scrape multiple endpoints from one egress IP without contending for buckets.

Scrape Interval & Rate Limit

  • Recommended scrape interval: 60s
  • Rate limit: 6 req/min per API key (HTTP 429 on overrun)
The backing data refreshes once per minute. Scraping more often than every 10s won’t surface new data — it just burns through your rate budget.

Available

  • Dedicated endpoints only.
  • Serverless (pass.wafer.ai) keys return 403 — the shared-fleet metrics aren’t meaningfully scoped to a single customer’s key. Use the Billing API for serverless usage.

Metrics

All series carry a endpoint="<your-endpoint>" label. The _last_24h suffix marks a 24h-rolling gauge (not a true monotonic counter — divide by 86400 for an average rate, or alert on absolute counts).

Request volume

  • wafer_requests_last_24h{status_class} — counts by 2xx / 4xx / 5xx
  • wafer_input_tokens_last_24h — total input tokens billed
  • wafer_output_tokens_last_24h — total output tokens billed
  • wafer_cache_read_tokens_last_24h — input tokens served from prompt cache
  • wafer_cache_hit_ratio_last_24h0.01.0, cache_read / input_tokens

Latency (24h-window percentile gauges)

  • wafer_ttft_seconds{quantile}0.5 / 0.9 / 0.95 / 0.99. Streaming requests only.
  • wafer_request_duration_seconds{quantile}0.5 / 0.9 / 0.99. End-to-end wall time.
  • wafer_output_tokens_per_second{quantile}0.1 / 0.5 / 0.9 / 0.99. Per-request output throughput.

Live engine state

These reflect the most recent ~1 min from the worker fleet. If a worker stopped reporting more than 10 min ago, the series drops out of the scrape (rather than continuing to show stale state).
  • wafer_engine_running_requests — in-flight requests, summed across replicas. Equivalent to sglang’s num_running_reqs.
  • wafer_engine_queue_depth — queued requests waiting for engine admission, summed across replicas.
  • wafer_engine_kv_cache_usage_ratio — KV cache utilization, averaged across replicas. 0.01.0.
  • wafer_engine_kv_cache_hit_ratio — engine-side KV cache hit rate. 0.01.0. Note: this is the engine’s internal cache metric and may diverge from wafer_cache_hit_ratio_last_24h, which is computed from billed token counts.

Activity

  • wafer_concurrent_requests — distinct requests in flight in the last 2 min. Coarse “is the endpoint busy” gauge.
  • wafer_metrics_window_seconds — constant 86400, exposed so dashboards can label tiles correctly if we ever change the aggregation window.

Prometheus Config

scrape_configs:
  - job_name: wafer
    scrape_interval: 60s
    metrics_path: /v1/endpoints/metrics/prometheus
    scheme: https
    params:
      endpoint: ["<ENDPOINT_HOST>"]
    authorization:
      type: Bearer
      credentials: <API_KEY>
    static_configs:
      - targets: ["api.wafer.ai"]
For multiple endpoints, repeat the scrape config — each gets an independent rate-limit bucket.

Grafana Cloud Agent

Same shape, with the key in the headers block:
prometheus:
  configs:
    - name: wafer
      scrape_configs:
        - job_name: wafer
          scrape_interval: 60s
          metrics_path: /v1/endpoints/metrics/prometheus
          scheme: https
          params:
            endpoint: ["<ENDPOINT_HOST>"]
          authorization:
            type: Bearer
            credentials: <API_KEY>
          static_configs:
            - targets: ["api.wafer.ai"]

OTel Collector

Use the Prometheus receiver:
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: wafer
          scrape_interval: 60s
          metrics_path: /v1/endpoints/metrics/prometheus
          scheme: https
          params:
            endpoint: ["<ENDPOINT_HOST>"]
          authorization:
            type: Bearer
            credentials: <API_KEY>
          static_configs:
            - targets: ["api.wafer.ai"]
Forward to Datadog, New Relic, or any OTLP backend via the standard collector pipeline.

Datadog Agent

Use the openmetrics check:
init_config:

instances:
  - openmetrics_endpoint: "https://api.wafer.ai/v1/endpoints/metrics/prometheus?endpoint=<ENDPOINT_HOST>"
    namespace: wafer
    metrics:
      - "wafer_*"
    headers:
      Authorization: "Bearer <API_KEY>"
    min_collection_interval: 60

Why gauges instead of counters?

We expose _last_24h gauges instead of monotonic _total counters because the underlying log retention is finite — a “since-beginning-of-time” counter would silently start dropping events once they aged out, which is worse than an honestly-named rolling window. For an average request rate, divide by 86400; for spike detection, alert on the absolute value relative to your baseline.

Why no histogram buckets?

The percentile series (_seconds, _per_second) are gauges with a quantile label, not Prometheus histograms. The backing query returns precomputed percentiles directly; rendering full histograms would require 8–12 additional queries per scrape and wouldn’t fit the rate-limit budget. You can chart and alert on the existing quantiles; PromQL histogram_quantile() doesn’t apply.

Errors

  • 401: missing or invalid API key
  • 403: dedicated-only — serverless keys, or a dedicated key against an endpoint it doesn’t own
  • 429: rate limit (6/min per key) — back off and retry
  • 5xx: backing query failed. The engine-gauge query failing degrades gracefully (you’ll still get the 24h-window series); the heavy summary query failing surfaces as a 5xx so your Prometheus target shows as down rather than a misleading 200 with zeros.

Notes

  • The 24h window matches the dashboard’s default range, so values in PromQL line up with what you see in the Wafer UI.
  • Cardinality is intentionally bounded: endpoint (always), plus status_class (3 values) or quantile (3–4 values) on the partitioned series. No per-model or per-request labels in v1.
  • For per-request debugging, use Request Inspection. For the same data as JSON, use the Metrics API.