Prometheus Scrape

Wafer exposes a per-endpoint Prometheus scrape URL so you can wire your existing Grafana / Datadog / New Relic / OTel Collector setup to a dedicated endpoint without going through the JSON Metrics API.

Run

curl -s "https://api.wafer.ai/v1/endpoints/metrics/prometheus?endpoint=<ENDPOINT_HOST>" \
  -H "Authorization: Bearer <API_KEY>"

Response is the standard Prometheus 0.0.4 text-exposition format:

# HELP wafer_requests_last_24h Requests served in the last 24h, partitioned by HTTP status class. Rolling window — divide by 86400 for an average rate, or alert on absolute counts.
# TYPE wafer_requests_last_24h gauge
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="2xx"} 12000
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="4xx"} 300
wafer_requests_last_24h{endpoint="example-endpoint.wafer.ai",status_class="5xx"} 45
# HELP wafer_ttft_seconds Time to first token, in seconds. Reported as quantiles (0.5 / 0.9 / 0.95 / 0.99) over the last 24h. Streaming requests only.
# TYPE wafer_ttft_seconds gauge
wafer_ttft_seconds{endpoint="example-endpoint.wafer.ai",quantile="0.5"} 0.12
wafer_ttft_seconds{endpoint="example-endpoint.wafer.ai",quantile="0.9"} 0.25
...

Auth

Authorization: Bearer <API_KEY> — the same dedicated-endpoint key you use for inference. Each key gets its own rate-limit bucket, so a single Grafana agent can scrape multiple endpoints from one egress IP without contending for buckets.

Scrape Interval & Rate Limit

Recommended scrape interval: 60s
Rate limit: 6 req/min per API key (HTTP 429 on overrun)

The backing data refreshes once per minute. Scraping more often than every 10s won’t surface new data — it just burns through your rate budget.

Available

Dedicated endpoints only.
Serverless (pass.wafer.ai) keys return 403 — the shared-fleet metrics aren’t meaningfully scoped to a single customer’s key. Use the Billing API for serverless usage.

Metrics

All series carry a endpoint="<your-endpoint>" label. The _last_24h suffix marks a 24h-rolling gauge (not a true monotonic counter — divide by 86400 for an average rate, or alert on absolute counts).

Request volume

wafer_requests_last_24h{status_class} — counts by 2xx / 4xx / 5xx
wafer_input_tokens_last_24h — total input tokens billed
wafer_output_tokens_last_24h — total output tokens billed
wafer_cache_read_tokens_last_24h — input tokens served from prompt cache
wafer_cache_hit_ratio_last_24h — 0.0–1.0, cache_read / input_tokens

Latency (24h-window percentile gauges)

wafer_ttft_seconds{quantile} — 0.5 / 0.9 / 0.95 / 0.99. Streaming requests only.
wafer_request_duration_seconds{quantile} — 0.5 / 0.9 / 0.99. End-to-end wall time.
wafer_output_tokens_per_second{quantile} — 0.1 / 0.5 / 0.9 / 0.99. Per-request output throughput.

Live engine state

These reflect the most recent ~1 min from the worker fleet. If a worker stopped reporting more than 10 min ago, the series drops out of the scrape (rather than continuing to show stale state).

wafer_engine_running_requests — in-flight requests, summed across replicas. Equivalent to sglang’s num_running_reqs.
wafer_engine_queue_depth — queued requests waiting for engine admission, summed across replicas.
wafer_engine_kv_cache_usage_ratio — KV cache utilization, averaged across replicas. 0.0–1.0.
wafer_engine_kv_cache_hit_ratio — engine-side KV cache hit rate. 0.0–1.0. Note: this is the engine’s internal cache metric and may diverge from wafer_cache_hit_ratio_last_24h, which is computed from billed token counts.

Activity

wafer_concurrent_requests — distinct requests in flight in the last 2 min. Coarse “is the endpoint busy” gauge.
wafer_metrics_window_seconds — constant 86400, exposed so dashboards can label tiles correctly if we ever change the aggregation window.

Prometheus Config

scrape_configs:
  - job_name: wafer
    scrape_interval: 60s
    metrics_path: /v1/endpoints/metrics/prometheus
    scheme: https
    params:
      endpoint: ["<ENDPOINT_HOST>"]
    authorization:
      type: Bearer
      credentials: <API_KEY>
    static_configs:
      - targets: ["api.wafer.ai"]

For multiple endpoints, repeat the scrape config — each gets an independent rate-limit bucket.

Grafana Cloud Agent

Same shape, with the key in the headers block:

prometheus:
  configs:
    - name: wafer
      scrape_configs:
        - job_name: wafer
          scrape_interval: 60s
          metrics_path: /v1/endpoints/metrics/prometheus
          scheme: https
          params:
            endpoint: ["<ENDPOINT_HOST>"]
          authorization:
            type: Bearer
            credentials: <API_KEY>
          static_configs:
            - targets: ["api.wafer.ai"]

OTel Collector

Use the Prometheus receiver:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: wafer
          scrape_interval: 60s
          metrics_path: /v1/endpoints/metrics/prometheus
          scheme: https
          params:
            endpoint: ["<ENDPOINT_HOST>"]
          authorization:
            type: Bearer
            credentials: <API_KEY>
          static_configs:
            - targets: ["api.wafer.ai"]

Forward to Datadog, New Relic, or any OTLP backend via the standard collector pipeline.

Datadog Agent

Use the openmetrics check:

init_config:

instances:
  - openmetrics_endpoint: "https://api.wafer.ai/v1/endpoints/metrics/prometheus?endpoint=<ENDPOINT_HOST>"
    namespace: wafer
    metrics:
      - "wafer_*"
    headers:
      Authorization: "Bearer <API_KEY>"
    min_collection_interval: 60

Why gauges instead of counters?

We expose _last_24h gauges instead of monotonic _total counters because the underlying log retention is finite — a “since-beginning-of-time” counter would silently start dropping events once they aged out, which is worse than an honestly-named rolling window. For an average request rate, divide by 86400; for spike detection, alert on the absolute value relative to your baseline.

Why no histogram buckets?

The percentile series (_seconds, _per_second) are gauges with a quantile label, not Prometheus histograms. The backing query returns precomputed percentiles directly; rendering full histograms would require 8–12 additional queries per scrape and wouldn’t fit the rate-limit budget. You can chart and alert on the existing quantiles; PromQL histogram_quantile() doesn’t apply.

Errors

401: missing or invalid API key
403: dedicated-only — serverless keys, or a dedicated key against an endpoint it doesn’t own
429: rate limit (6/min per key) — back off and retry
5xx: backing query failed. The engine-gauge query failing degrades gracefully (you’ll still get the 24h-window series); the heavy summary query failing surfaces as a 5xx so your Prometheus target shows as down rather than a misleading 200 with zeros.

Notes

The 24h window matches the dashboard’s default range, so values in PromQL line up with what you see in the Wafer UI.
Cardinality is intentionally bounded: endpoint (always), plus status_class (3 values) or quantile (3–4 values) on the partitioned series. No per-model or per-request labels in v1.
For per-request debugging, use Request Inspection. For the same data as JSON, use the Metrics API.

​Run

​Auth

​Scrape Interval & Rate Limit

​Available

​Metrics

​Request volume

​Latency (24h-window percentile gauges)

​Live engine state

​Activity

​Prometheus Config

​Grafana Cloud Agent

​OTel Collector

​Datadog Agent

​Why gauges instead of counters?

​Why no histogram buckets?

​Errors

​Notes

Run