Skip to main content
Use request inspection to debug failed or slow requests.

Save the Request ID

Inference responses include an x-request-id header.
x-request-id: <REQUEST_ID>

List Requests

curl -s "https://api.wafer.ai/v1/endpoints/requests?endpoint=<ENDPOINT_HOST>&limit=20&errors_only=true" \
  -H "Authorization: Bearer <API_KEY>"
{
  "requests": [
    {
      "request_id": "<REQUEST_ID>",
      "status_code": 400,
      "model_requested": "<MODEL_ID>",
      "model_resolved": "<MODEL_ID>",
      "is_streaming": true,
      "ttft_ms": null,
      "total_latency_ms": 31,
      "input_tokens": 0,
      "output_tokens": 0,
      "cache_read_tokens": 0,
      "error_code": "invalid_json_request",
      "error_message": "Could not parse JSON body",
      "created_at": "<TIMESTAMP>"
    }
  ],
  "has_more": false,
  "cursor": null
}
The list endpoint returns a skinny per-request projection. Use Get One Request for the full debug record on a specific ID.

Get One Request

curl -s "https://api.wafer.ai/v1/endpoints/requests/<REQUEST_ID>?endpoint=<ENDPOINT_HOST>" \
  -H "Authorization: Bearer <API_KEY>"
<REQUEST_ID> accepts either the 12-char x-request-id you saw in the response header or the matching UUID surfaced elsewhere in the dashboard. Example response — the single-request endpoint surfaces the full per-request debug record so you can investigate slow-TTFT / bad-cache requests without leaving the API:
{
  "request_id": "<REQUEST_ID>",
  "status_code": 200,
  "model_requested": "<MODEL_ID>",
  "model_resolved": "<MODEL_ID>",
  "is_streaming": true,
  "ttft_ms": 428,
  "total_latency_ms": 2413,
  "input_tokens": 1180,
  "output_tokens": 241,
  "cache_read_tokens": 960,
  "error_code": null,
  "error_message": null,
  "created_at": "<TIMESTAMP>",
  "output_tps": 99.8,
  "duration_ms": 2380,
  "finish_reason": "stop",
  "matched_stop_token_id": null,
  "stall_count": 0,
  "path": "/v1/chat/completions",
  "method": "POST",
  "temperature": 0.7,
  "top_p": 1.0,
  "tool_count": 0,
  "transport_status_code": 200,
  "stream_error_status_code": null,
  "stream_error_type": null,
  "usage_present": true,
  "usage_parse_error": null
}

Parameters (List)

  • endpoint (required): <ENDPOINT_HOST>
  • limit (optional, default 50): 1-200
  • cursor (optional): ISO-8601 created_at cursor for pagination (must include a timezone, e.g. 2026-05-22T14:55:00+00:00)
  • errors_only (optional): true or false. When true, returns only 4xx/5xx requests pulled directly from Axiom logs (ignores the tail-latency / cache filters below).
  • min_ttft_ms, max_ttft_ms (optional, integers): inclusive bounds on ttft_ms. Useful for finding the requests behind a TTFT p90/p99 spike.
  • min_total_latency_ms, max_total_latency_ms (optional, integers): inclusive bounds on total_latency_ms.
  • max_cache_hit_pct (optional, 0.0–100.0): only requests where cache_read_tokens / input_tokens * 100 <= max_cache_hit_pct. Combine with min_input_tokens to skip the tiny / error rows that would otherwise dominate the result.
  • min_input_tokens (optional, integer): only requests with input_tokens >= min_input_tokens.
  • tag (optional, repeatable): only requests carrying at least one of the given values. Matches against both the scalar request_tag (the first tag set on the request) and any value in the request_tags array (the full set when the customer sent multiple). Repeat the param to OR multiple filter values (?tag=prewarm&tag=eval). Up to 20 filter values per query; each value must match ^[a-z0-9_-]{1,32}$. See Request Tagging for how to set tags on the inference side.
Filters AND-combine. Helpful patterns:
  • ?min_ttft_ms=2000&limit=20 — the recent slow-start requests behind a TTFT p99 alert.
  • ?max_cache_hit_pct=20&min_input_tokens=1000 — the low-cache-hit tail hidden behind a high aggregate hit %.
  • ?min_total_latency_ms=5000&max_cache_hit_pct=50 — slow AND poorly cached, the typical signature of cold-prefix requests.
  • ?tag=prewarm — only your customer-tagged prewarm requests. Combine with the filters above to slice analysis by traffic class.

Key Fields

Always present on both list and single-request responses:
  • request_id: stable request ID (12-char hex x-request-id, or the UUID for newer rows)
  • status_code: final HTTP status code
  • model_requested, model_resolved: requested and resolved model IDs
  • is_streaming: true for SSE / chunked-response requests
  • ttft_ms: time to first token (streaming only; null otherwise)
  • total_latency_ms: end-to-end wall latency
  • input_tokens, output_tokens, cache_read_tokens: token counts
  • error_code: Wafer error code when available (see the Error Reference)
  • error_message: human-readable error description when available
  • created_at: UTC timestamp
  • request_tag: the first Wafer-Request-Tag header set on the inference call, or null if you didn’t tag the request. Single-tag callers can keep reading just this field.
  • request_tags: the full ordered list of Wafer-Request-Tag header values set on the call (1–8 entries), or null if untagged. When you only sent one tag, this is a one-element list containing the same value as request_tag. See Request Tagging.
Additional fields, returned only by GET /requests/{id} (the list endpoint omits these to keep paginated responses small — fetch the id from the list, then GET /{id} for the rest):
  • output_tps: generation speed for this request, as opposed to the bin-aggregated TPS in Metrics.
  • duration_ms: upstream/inference time. A large total_latency_ms − duration_ms gap indicates queueing or transport overhead.
  • finish_reason and matched_stop_token_id: distinguish a natural stop from length / tool_calls / abrupt cutoff.
  • stall_count: intra-stream pauses longer than 1s between tokens.
  • path, method: the HTTP route actually invoked.
  • temperature, top_p, tool_count: request-body characteristics. Use these to answer “is this slow request a tool-heavy / high-temp call?”
  • transport_status_code, stream_error_status_code, stream_error_type: error / quality signals. Mostly null on happy-path 2xx requests. Useful when the stream broke mid-response.
  • usage_present, usage_parse_error: false / non-null when wafer-edge couldn’t parse a usage payload back from the upstream.
Worker-routing internals (backend_id, winner_backend_id, node_name, inference_engine) are intentionally not exposed.

Pagination

If has_more is true, pass the returned cursor value into the next request:
curl -s "https://api.wafer.ai/v1/endpoints/requests?endpoint=<ENDPOINT_HOST>&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer <API_KEY>"
cursor is a created_at ISO-8601 timestamp (with timezone); the API serves rows strictly older than the cursor.

Error-Only Mode

Use errors_only=true to list only 4xx and 5xx requests. This path queries Axiom logs directly so it can surface error requests that never made it into the request table (e.g. JSON-parse failures, body-too-large rejections). The tail-latency and cache filters above are not applied in errors_only mode.

Errors

  • 401: missing or invalid API key
  • 403: API key does not have access to the requested endpoint
  • 404: request ID was not found for that endpoint
  • 422: invalid cursor format, missing timezone on the cursor, or invalid request_id format (must be a UUID or 12-char hex x-request-id)

Request Tagging

Attach one or more short customer-defined labels to any inference request by setting Wafer-Request-Tag on the call to /v1/chat/completions, /v1/messages, or /v1/completions. Tags are recorded with the request and surface in the response above as request_tag (first tag) and request_tags (full list), plus the ?tag=<value> filter on this endpoint and the tag chip picker on the dedicated-endpoint dashboard. A single tag — most customers:
curl -sS "https://flip.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "Wafer-Request-Tag: prewarm" \
  -d '{"model": "GLM-5.1", "messages": [...]}'
Multiple tags — repeat the header. Use this when a single request fits more than one slice you care about (e.g. it’s a prewarm AND it’s for your e-commerce vertical):
curl -sS "https://flip.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "Wafer-Request-Tag: prewarm" \
  -H "Wafer-Request-Tag: ecomm" \
  -H "Wafer-Request-Tag: healthcare" \
  -d '{"model": "GLM-5.1", "messages": [...]}'
Multiple tags — comma-separated in one header. Equivalent to the repeated-header form; pick whichever your HTTP client makes easier:
curl -sS "https://flip.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -H "Wafer-Request-Tag: prewarm,ecomm,healthcare" \
  -d '{"model": "GLM-5.1", "messages": [...]}'
Both shapes can be mixed in a single request (Wafer-Request-Tag: prewarm,ecomm + Wafer-Request-Tag: healthcare); the deduped union is what gets stored. Common tag values customers use:
  • prewarm — cache-warming pings that don’t represent real user load.
  • health-check — synthetic uptime probes from your own infra.
  • eval / eval-run-<id> — offline quality evaluations you want to separate from production traffic.
  • synthetic — anything your client treats as not-from-a-real-user.
  • Vertical / tenant labels (ecomm, healthcare, tenant-acme) — when one fleet serves multiple customer segments and you want to slice the dashboard by who’s driving load.
The values are entirely up to you; Wafer just stores and filters on them.

Header rules

  • Format: ^[a-z0-9_-]{1,32}$ per value — lowercase ASCII, digits, dashes, underscores. 1–32 characters.
  • Two equivalent ways to send multiple tags: repeat the header (Wafer-Request-Tag: prewarm + Wafer-Request-Tag: ecomm) or comma-separate in one header (Wafer-Request-Tag: prewarm,ecomm). Whitespace around commas is fine (prewarm, ecomm). The two shapes can be mixed in a single request.
  • Up to 8 tags per request. Repeated values are deduped (sending prewarm twice on the same request — via repeated headers or a comma list — stores prewarm once); order is preserved.
  • Empty parts (trailing comma, double comma) are rejected as malformed so a typo like prewarm,ecomm, doesn’t silently drop the third tag.
  • Malformed values are rejected with 400 invalid_request_tag so silent drift between what you send and what Wafer stores is impossible. The error message identifies the offending value when multiple tags are sent.
  • The header is dedicated-endpoint only. If your key is a Wafer Serverless key (used on pass.wafer.ai), the header is silently ignored — sending it from a generic client won’t fail your request, but it also won’t show up in the dashboard.
  • Tags are per-request. Untagged calls store null for both request_tag and request_tags, and are excluded by any ?tag= filter.

Filtering by tag

Once you’re tagging traffic, use the tag query param on GET /v1/endpoints/requests:
# Only the prewarm pings:
curl -s "https://api.wafer.ai/v1/endpoints/requests?endpoint=flip.wafer.ai&tag=prewarm&limit=50" \
  -H "Authorization: Bearer <API_KEY>"

# Prewarms OR health-checks:
curl -s "https://api.wafer.ai/v1/endpoints/requests?endpoint=flip.wafer.ai&tag=prewarm&tag=health-check" \
  -H "Authorization: Bearer <API_KEY>"
  • Up to 20 tag values per query.
  • Each value validated against the same regex; an invalid tag in the filter returns 422.
  • The filter is an OR-match against both request_tag and request_tags: a request matches if any of its stored tags is in the filter set. So ?tag=ecomm returns both single-tag rows where request_tag = "ecomm" and multi-tag rows where "ecomm" appears anywhere in request_tags.
  • To exclude a tag (e.g. “everything except prewarms”), the API doesn’t have a negative match yet — post-filter on request_tags (or request_tag for single-tag callers) client-side. If first-class exclude is useful for you, let us know.

Common Workflows

Tail-latency investigation

Start in Metrics and chase the spike down to a single bad request:
  1. GET /v1/endpoints/metrics?endpoint=<HOST>&range_minutes=60 — confirm ttft_p95_ms (or ttft_p99_ms) is elevated.
  2. GET /v1/endpoints/requests?endpoint=<HOST>&min_ttft_ms=<p95_value>&limit=20 — pull the rows behind the spike.
  3. GET /v1/endpoints/requests/<REQUEST_ID>?endpoint=<HOST> — inspect a single offender. The gap between total_latency_ms and duration_ms separates queueing/transport overhead from upstream slowness; stall_count > 0 means the stream paused mid-response; output_tps shows the per-request generation rate.

Low-cache-hit large requests

When cache_hit_pct looks healthy in aggregate but you want the specific large requests that missed cache — to figure out what’s varying in the prompt prefix (a datetime, a session ID, a per-user header) — combine max_cache_hit_pct with min_input_tokens:
  1. GET /v1/endpoints/metrics?endpoint=<HOST>&range_minutes=360 — confirm aggregate cache_hit_pct is healthy. The interesting requests are the tail dragging it down.
  2. GET /v1/endpoints/requests?endpoint=<HOST>&max_cache_hit_pct=5&min_input_tokens=20000&limit=50 — surface large prompts that hit ≤5% cache. Tune the two thresholds to your traffic:
    • max_cache_hit_pct low (say 5 or 10) isolates the genuinely cache-missing requests.
    • min_input_tokens large (say 20000+) skips small requests — short chat turns, prewarm pings, and anything else that wouldn’t carry a reusable prefix anyway. This is a size proxy. If you want to label your own prewarm traffic explicitly so you can include or exclude it precisely, send a request tag on the inference call and filter with ?tag= here.
  3. For each offender, GET /requests/{id} and compare input_tokens vs cache_read_tokens. The difference (input_tokens − cache_read_tokens) is the fresh prompt content — the tokens that had to be re-encoded from scratch this turn. Large fresh against a prompt that should have a stable prefix usually points at:
    • A dynamic value injected near the top of the prompt (timestamps, request IDs, randomized example ordering).
    • A new system prompt or rotated context.
    • A request that fell off the cache eviction window.
The dashboard’s per-request export carries this same shape — input_tokens, cache_read_tokens, and the derived fresh = input_tokens − cache_read_tokens and hit_pct columns — so you can pull the offender list locally and diff prompts side-by-side.

Filter by your own tags

If you’ve tagged your inference calls with Wafer-Request-Tag (see Request Tagging), use ?tag= to scope the request list:
  • ?tag=prewarm&limit=20 — only the requests your client labelled as prewarms.
  • ?tag=prewarm&tag=health-check — multiple tags OR together (returns rows matching either).
  • Combine with the other filters to slice your analysis: ?tag=prewarm&max_cache_hit_pct=5&min_input_tokens=20000 will surface large prewarm prompts that missed cache, in case your prewarm payloads need re-tuning.
To exclude a tag, omit tag= and instead post-filter client-side on the returned request_tags field — the API doesn’t expose a negative match yet. If that’s important to your workflow, let us know.

Streaming health check

is_streaming = true with ttft_ms = null on the list endpoint means the stream errored before the first token. Use errors_only=true to pull those rows, then check stream_error_status_code and stream_error_type on the single-request response to see where the upstream broke.

Notes

  • Request lookup is endpoint-scoped. A valid request ID for another endpoint returns 404.
  • Use x-request-id for exact correlation between your application logs and Wafer’s logs.
  • GET /requests/{id} falls back from total_latency_ms to duration_ms for older rows that don’t have the edge-to-edge wall-time measurement.