Save the Request ID
Inference responses include anx-request-id header.
List Requests
Get One Request
<REQUEST_ID> accepts either the 12-char x-request-id you saw in the response header or the matching UUID surfaced elsewhere in the dashboard.
Example response — the single-request endpoint surfaces the full per-request debug record so you can investigate slow-TTFT / bad-cache requests without leaving the API:
Parameters (List)
endpoint(required):<ENDPOINT_HOST>limit(optional, default50):1-200cursor(optional): ISO-8601created_atcursor for pagination (must include a timezone, e.g.2026-05-22T14:55:00+00:00)errors_only(optional):trueorfalse. Whentrue, returns only4xx/5xxrequests pulled directly from Axiom logs (ignores the tail-latency / cache filters below).min_ttft_ms,max_ttft_ms(optional, integers): inclusive bounds onttft_ms. Useful for finding the requests behind a TTFT p90/p99 spike.min_total_latency_ms,max_total_latency_ms(optional, integers): inclusive bounds ontotal_latency_ms.max_cache_hit_pct(optional, 0.0–100.0): only requests wherecache_read_tokens / input_tokens * 100 <= max_cache_hit_pct. Combine withmin_input_tokensto skip the tiny / error rows that would otherwise dominate the result.min_input_tokens(optional, integer): only requests withinput_tokens >= min_input_tokens.tag(optional, repeatable): only requests carrying at least one of the given values. Matches against both the scalarrequest_tag(the first tag set on the request) and any value in therequest_tagsarray (the full set when the customer sent multiple). Repeat the param to OR multiple filter values (?tag=prewarm&tag=eval). Up to 20 filter values per query; each value must match^[a-z0-9_-]{1,32}$. See Request Tagging for how to set tags on the inference side.
?min_ttft_ms=2000&limit=20— the recent slow-start requests behind a TTFT p99 alert.?max_cache_hit_pct=20&min_input_tokens=1000— the low-cache-hit tail hidden behind a high aggregate hit %.?min_total_latency_ms=5000&max_cache_hit_pct=50— slow AND poorly cached, the typical signature of cold-prefix requests.?tag=prewarm— only your customer-tagged prewarm requests. Combine with the filters above to slice analysis by traffic class.
Key Fields
Always present on both list and single-request responses:request_id: stable request ID (12-char hexx-request-id, or the UUID for newer rows)status_code: final HTTP status codemodel_requested,model_resolved: requested and resolved model IDsis_streaming:truefor SSE / chunked-response requeststtft_ms: time to first token (streaming only;nullotherwise)total_latency_ms: end-to-end wall latencyinput_tokens,output_tokens,cache_read_tokens: token countserror_code: Wafer error code when available (see the Error Reference)error_message: human-readable error description when availablecreated_at: UTC timestamprequest_tag: the firstWafer-Request-Tagheader set on the inference call, ornullif you didn’t tag the request. Single-tag callers can keep reading just this field.request_tags: the full ordered list ofWafer-Request-Tagheader values set on the call (1–8 entries), ornullif untagged. When you only sent one tag, this is a one-element list containing the same value asrequest_tag. See Request Tagging.
GET /requests/{id} (the list endpoint omits these to keep paginated responses small — fetch the id from the list, then GET /{id} for the rest):
output_tps: generation speed for this request, as opposed to the bin-aggregated TPS in Metrics.duration_ms: upstream/inference time. A largetotal_latency_ms − duration_msgap indicates queueing or transport overhead.finish_reasonandmatched_stop_token_id: distinguish a natural stop fromlength/tool_calls/ abrupt cutoff.stall_count: intra-stream pauses longer than 1s between tokens.path,method: the HTTP route actually invoked.temperature,top_p,tool_count: request-body characteristics. Use these to answer “is this slow request a tool-heavy / high-temp call?”transport_status_code,stream_error_status_code,stream_error_type: error / quality signals. Mostlynullon happy-path 2xx requests. Useful when the stream broke mid-response.usage_present,usage_parse_error:false/ non-null when wafer-edge couldn’t parse a usage payload back from the upstream.
backend_id, winner_backend_id, node_name, inference_engine) are intentionally not exposed.
Pagination
Ifhas_more is true, pass the returned cursor value into the next request:
cursor is a created_at ISO-8601 timestamp (with timezone); the API serves rows strictly older than the cursor.
Error-Only Mode
Useerrors_only=true to list only 4xx and 5xx requests. This path queries Axiom logs directly so it can surface error requests that never made it into the request table (e.g. JSON-parse failures, body-too-large rejections).
The tail-latency and cache filters above are not applied in errors_only mode.
Errors
401: missing or invalid API key403: API key does not have access to the requested endpoint404: request ID was not found for that endpoint422: invalid cursor format, missing timezone on the cursor, or invalidrequest_idformat (must be a UUID or 12-char hexx-request-id)
Request Tagging
Attach one or more short customer-defined labels to any inference request by settingWafer-Request-Tag on the call to /v1/chat/completions, /v1/messages, or /v1/completions. Tags are recorded with the request and surface in the response above as request_tag (first tag) and request_tags (full list), plus the ?tag=<value> filter on this endpoint and the tag chip picker on the dedicated-endpoint dashboard.
A single tag — most customers:
Wafer-Request-Tag: prewarm,ecomm + Wafer-Request-Tag: healthcare); the deduped union is what gets stored.
Common tag values customers use:
prewarm— cache-warming pings that don’t represent real user load.health-check— synthetic uptime probes from your own infra.eval/eval-run-<id>— offline quality evaluations you want to separate from production traffic.synthetic— anything your client treats as not-from-a-real-user.- Vertical / tenant labels (
ecomm,healthcare,tenant-acme) — when one fleet serves multiple customer segments and you want to slice the dashboard by who’s driving load.
Header rules
- Format:
^[a-z0-9_-]{1,32}$per value — lowercase ASCII, digits, dashes, underscores. 1–32 characters. - Two equivalent ways to send multiple tags: repeat the header (
Wafer-Request-Tag: prewarm+Wafer-Request-Tag: ecomm) or comma-separate in one header (Wafer-Request-Tag: prewarm,ecomm). Whitespace around commas is fine (prewarm, ecomm). The two shapes can be mixed in a single request. - Up to 8 tags per request. Repeated values are deduped (sending
prewarmtwice on the same request — via repeated headers or a comma list — storesprewarmonce); order is preserved. - Empty parts (trailing comma, double comma) are rejected as malformed so a typo like
prewarm,ecomm,doesn’t silently drop the third tag. - Malformed values are rejected with
400 invalid_request_tagso silent drift between what you send and what Wafer stores is impossible. The error message identifies the offending value when multiple tags are sent. - The header is dedicated-endpoint only. If your key is a Wafer Serverless key (used on
pass.wafer.ai), the header is silently ignored — sending it from a generic client won’t fail your request, but it also won’t show up in the dashboard. - Tags are per-request. Untagged calls store
nullfor bothrequest_tagandrequest_tags, and are excluded by any?tag=filter.
Filtering by tag
Once you’re tagging traffic, use thetag query param on GET /v1/endpoints/requests:
- Up to 20 tag values per query.
- Each value validated against the same regex; an invalid tag in the filter returns
422. - The filter is an OR-match against both
request_tagandrequest_tags: a request matches if any of its stored tags is in the filter set. So?tag=ecommreturns both single-tag rows whererequest_tag = "ecomm"and multi-tag rows where"ecomm"appears anywhere inrequest_tags. - To exclude a tag (e.g. “everything except prewarms”), the API doesn’t have a negative match yet — post-filter on
request_tags(orrequest_tagfor single-tag callers) client-side. If first-class exclude is useful for you, let us know.
Common Workflows
Tail-latency investigation
Start in Metrics and chase the spike down to a single bad request:GET /v1/endpoints/metrics?endpoint=<HOST>&range_minutes=60— confirmttft_p95_ms(orttft_p99_ms) is elevated.GET /v1/endpoints/requests?endpoint=<HOST>&min_ttft_ms=<p95_value>&limit=20— pull the rows behind the spike.GET /v1/endpoints/requests/<REQUEST_ID>?endpoint=<HOST>— inspect a single offender. The gap betweentotal_latency_msandduration_msseparates queueing/transport overhead from upstream slowness;stall_count > 0means the stream paused mid-response;output_tpsshows the per-request generation rate.
Low-cache-hit large requests
Whencache_hit_pct looks healthy in aggregate but you want the specific large requests that missed cache — to figure out what’s varying in the prompt prefix (a datetime, a session ID, a per-user header) — combine max_cache_hit_pct with min_input_tokens:
GET /v1/endpoints/metrics?endpoint=<HOST>&range_minutes=360— confirm aggregatecache_hit_pctis healthy. The interesting requests are the tail dragging it down.GET /v1/endpoints/requests?endpoint=<HOST>&max_cache_hit_pct=5&min_input_tokens=20000&limit=50— surface large prompts that hit ≤5% cache. Tune the two thresholds to your traffic:max_cache_hit_pctlow (say5or10) isolates the genuinely cache-missing requests.min_input_tokenslarge (say20000+) skips small requests — short chat turns, prewarm pings, and anything else that wouldn’t carry a reusable prefix anyway. This is a size proxy. If you want to label your own prewarm traffic explicitly so you can include or exclude it precisely, send a request tag on the inference call and filter with?tag=here.
- For each offender,
GET /requests/{id}and compareinput_tokensvscache_read_tokens. The difference (input_tokens − cache_read_tokens) is the fresh prompt content — the tokens that had to be re-encoded from scratch this turn. Largefreshagainst a prompt that should have a stable prefix usually points at:- A dynamic value injected near the top of the prompt (timestamps, request IDs, randomized example ordering).
- A new system prompt or rotated context.
- A request that fell off the cache eviction window.
input_tokens, cache_read_tokens, and the derived fresh = input_tokens − cache_read_tokens and hit_pct columns — so you can pull the offender list locally and diff prompts side-by-side.
Filter by your own tags
If you’ve tagged your inference calls withWafer-Request-Tag (see Request Tagging), use ?tag= to scope the request list:
?tag=prewarm&limit=20— only the requests your client labelled as prewarms.?tag=prewarm&tag=health-check— multiple tags OR together (returns rows matching either).- Combine with the other filters to slice your analysis:
?tag=prewarm&max_cache_hit_pct=5&min_input_tokens=20000will surface large prewarm prompts that missed cache, in case your prewarm payloads need re-tuning.
tag= and instead post-filter client-side on the returned request_tags field — the API doesn’t expose a negative match yet. If that’s important to your workflow, let us know.
Streaming health check
is_streaming = true with ttft_ms = null on the list endpoint means the stream errored before the first token. Use errors_only=true to pull those rows, then check stream_error_status_code and stream_error_type on the single-request response to see where the upstream broke.
Notes
- Request lookup is endpoint-scoped. A valid request ID for another endpoint returns
404. - Use
x-request-idfor exact correlation between your application logs and Wafer’s logs. GET /requests/{id}falls back fromtotal_latency_mstoduration_msfor older rows that don’t have the edge-to-edge wall-time measurement.