Use the Serverless API directly when you are building against Wafer from your own application, scripts, or low-level tooling. For Claude Code, Codex, Cline, Roo Code, and other agent harnesses, use Agent Setup instead.
Base URL
| Surface | URL |
|---|
| OpenAI-compatible API | https://pass.wafer.ai/v1 |
| Anthropic-compatible Messages API | https://pass.wafer.ai/v1/messages |
Send your API key on every request:
Authorization: Bearer <YOUR_WAFER_API_KEY>
To require Zero Data Retention for a single request, add:
List Models
curl -sS "https://pass.wafer.ai/v1/models" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>"
The model list is the source of truth for currently available Serverless model IDs. Each card layers Wafer-specific capabilities and pricing on top of the standard OpenAI shape:
{
"object": "list",
"data": [
{
"id": "GLM-5.1",
"object": "model",
"created": 1779148800,
"owned_by": "wafer",
"max_model_len": 202752,
"zdr_supported": true,
"wafer": {
"display_name": "GLM-5.1",
"description": "General Language Model 5.1 — high-quality bilingual (EN/ZH) generation with strong coding and reasoning capabilities.",
"tier": "pass_included",
"context_length": 202752,
"capabilities": {
"vision": false,
"tools": true,
"reasoning": true
},
"pricing": {
"input_cents_per_million": 100,
"output_cents_per_million": 320,
"cache_read_cents_per_million": 10
}
}
}
]
}
id, object, created, owned_by are the stable OpenAI fields — SDKs that only read these keep working.
max_model_len is the hard context-window cap; requests past it return context_length_exceeded.
zdr_supported: true means the model accepts Wafer-ZDR: required. Models without ZDR support omit the field or set it false.
wafer.capabilities.{vision, tools, reasoning} tell you what features the model actually supports — branch on these instead of hard-coding per-model logic.
wafer.pricing is in cents per million tokens and is what we’ll bill at; check it whenever pricing changes matter to your code path.
Chat Completions
Use POST /v1/chat/completions for ordinary text prompts and OpenAI-compatible clients:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"messages": [
{"role": "user", "content": "Reply with the single word: ready."}
],
"max_tokens": 16,
"temperature": 0
}'
Add Wafer-ZDR: required when the request must only route to ZDR-capable infrastructure:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Wafer-ZDR: required" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B",
"messages": [{"role": "user", "content": "Summarize what Wafer does."}],
"max_tokens": 128
}'
Reasoning Controls
Reasoning-capable models can return a separate reasoning_content field alongside the final answer. Discover support from GET /v1/models by checking wafer.capabilities.reasoning.
Wafer accepts three equivalent control shapes:
| Shape | Use |
|---|
thinking: {"type": "enabled"} or thinking: {"type": "disabled"} | Recommended for simple on/off examples. |
reasoning_effort set to none, low, medium, high, or max | Use when you want an explicit effort level. |
enable_thinking: true or enable_thinking: false | Qwen/DashScope-compatible shape. |
Default behavior is reasoning off unless you explicitly enable it — for every reasoning-capable model.
The same on/off curl shape works across reasoning-capable models. For example,
swap the model value to deepseek-v4-pro to run DeepSeek V4 Pro with the
same toggle. You can also use reasoning_effort (none, low, medium,
high, or max) when you want an explicit effort level.
Where the reasoning text appears. Most reasoning-capable models (GLM-5.1,
Kimi-K2.6, Qwen3.5-397B-A17B, Qwen3.6-35B-A3B, qwen3.6-max-preview,
qwen3.7-max, deepseek-v4-flash, deepseek-v4-pro) return reasoning in a
separate reasoning_content field on the assistant message.MiniMax-M3 is an exception: it currently returns reasoning inline in
content as <think>…</think> text rather than in a separate field. If you’re
parsing reasoning programmatically, branch on the model — or strip the
<think> block from content before displaying.
Kimi-K2.6
With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "Kimi-K2.6",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 64,
"thinking": {"type": "disabled"}
}'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "Kimi-K2.6",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 256,
"thinking": {"type": "enabled"}
}'
GLM-5.1
With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 64,
"thinking": {"type": "disabled"}
}'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 256,
"thinking": {"type": "enabled"}
}'
Qwen3.5-397B-A17B
With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 64,
"thinking": {"type": "disabled"}
}'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-397B-A17B",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 256,
"thinking": {"type": "enabled"}
}'
DeepSeek V4 Pro
With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-pro",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 64,
"thinking": {"type": "disabled"}
}'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v4-pro",
"messages": [{"role": "user", "content": "Reply with exactly: ok"}],
"max_tokens": 256,
"thinking": {"type": "enabled"}
}'
Streaming
Set stream to true and add -N to receive server-sent events as they arrive:
curl -N -sS "https://pass.wafer.ai/v1/chat/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"messages": [{"role": "user", "content": "Write a one-sentence haiku."}],
"max_tokens": 64,
"temperature": 0.7,
"stream": true
}'
Usage chunks are always included on streaming requests. Wafer automatically
sets stream_options: {"include_usage": true, "continuous_usage_stats": true}
on every streaming chat completion so the final SSE chunk carries
usage.{prompt_tokens, completion_tokens, total_tokens}. You don’t need to send
stream_options yourself — and if you do, the auto-injected values still win.
This means you can reliably bill / track token spend from streaming responses
the same way you would from non-streaming.
Tool calls and streaming. When the model decides to call a tool, the full
tool_calls array arrives in a single chunk rather than streamed argument-
by-argument. Buffer the chunk before processing — partial tool-call deltas
will not occur on Wafer.
Chat Request Body
| Field | Type | Required | Notes |
|---|
model | string | Yes | Any Serverless model ID from GET /v1/models, such as GLM-5.1 or Qwen3.5-397B-A17B. |
messages | array | Yes | OpenAI-compatible chat messages with role and content. |
max_tokens | integer | No | Maximum generated tokens. Must be positive when provided. |
temperature | number | No | Sampling temperature. Use 0 for deterministic decoding. |
top_p | number | No | Nucleus sampling cutoff. |
top_k | integer | No | Limits sampling to the top K candidate tokens on supported models. |
min_p | number | No | Minimum probability threshold on supported models. |
frequency_penalty | number | No | Penalizes repeated tokens by frequency. |
presence_penalty | number | No | Penalizes tokens that have already appeared. |
repetition_penalty | number | No | SGLang repetition penalty on supported models. |
stop | string or array | No | Stop sequence or sequences. |
stream | boolean | No | When true, returns streaming chat completion chunks. |
tools | array | No | OpenAI-compatible tool definitions on models that support tool calling. |
tool_choice | string or object | No | Controls tool selection for compatible models. |
response_format | object | No | Use JSON mode or structured outputs on compatible models. |
logprobs | boolean | No | Request token log probabilities on compatible models. |
top_logprobs | integer | No | Number of log probabilities to include when logprobs is enabled. |
thinking | object or boolean | No | Recommended reasoning on/off control, for example {"type": "enabled"} or {"type": "disabled"}. |
reasoning_effort | string | No | Reasoning effort: none, low, medium, high, or max. |
enable_thinking | boolean | No | Compatibility reasoning on/off control. |
preserve_thinking | boolean | No | Wafer-shape for preserved reasoning across turns. Supported on Kimi-K2.6 and GLM-5.1. See Multi-turn preserved thinking. |
seed | integer | No | Deterministic sampling seed on backends that support it. |
n | integer | No | Number of completions to generate. Stripped on Kimi-K2.6 (see caveat below). |
logit_bias | object | No | Token-level logit adjustments. Supported on sglang backends. |
parallel_tool_calls | boolean | No | Allow the model to emit multiple tool_calls in a single response. |
stream_options | object | No | {"include_usage": true} to include token counts in the final streaming chunk. Wafer sets this automatically on every streaming request — see Streaming. |
Unsupported or model-specific parameters return a request error instead of being silently ignored — except where noted in Model-specific behavior.
Model-specific Behavior
A handful of routes intentionally diverge from the generic OpenAI/Anthropic contract. Know these before you ship.
Kimi-K2.6 sampling params are stripped
The non-ZDR Kimi-K2.6 route forwards to Moonshot’s hosted kimi-k2.6, which enforces fixed sampling values (temperature=1.0, top_p=0.95, n=1, presence_penalty=0, frequency_penalty=0) and rejects anything else. To keep behavior identical between the ZDR and non-ZDR Kimi backends, Wafer strips temperature, top_p, n, presence_penalty, and frequency_penalty from every Kimi-K2.6 request before forwarding.
If you send temperature: 0 to Kimi-K2.6, expect Moonshot-default sampling (temperature=1.0) at the model. Either pick a model where those controls take effect (GLM-5.1, Qwen3.5-397B-A17B, etc.), or compensate with prompt engineering / reasoning_effort.
MiniMax-M3 returns inline <think> reasoning
See the caveat in Reasoning Controls above. MiniMax-M3 does not populate reasoning_content; it inlines <think>…</think> in content instead.
Multi-turn preserved thinking
Kimi-K2.6 and GLM-5.1 accept either preserve_thinking: true (Wafer shape) or thinking: {"type": "enabled", "keep": "all"} (Moonshot shape) to carry prior turns’ reasoning back into the next turn’s context. The previous turn’s reasoning_content is inlined as <think>…</think> inside the assistant message before the chat template runs, so the model can build on its own earlier chain of thought.
{
"model": "Kimi-K2.6",
"thinking": {"type": "enabled"},
"preserve_thinking": true,
"messages": [
{"role": "user", "content": "Hard problem…"},
{
"role": "assistant",
"content": "…final answer from turn 1…",
"reasoning_content": "…chain of thought from turn 1…"
},
{"role": "user", "content": "Follow-up…"}
]
}
Default is off — reasoning is not preserved across turns unless you opt in.
Text Completions
Use POST /v1/completions only when you need token-ID prompts or constrained decoding on a supported route:
curl -sS "https://pass.wafer.ai/v1/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"prompt": [9703],
"max_tokens": 2,
"temperature": 0,
"ebnf": "root ::= \"A\" | \"B\""
}'
For the full /v1/completions request shape, streaming example, parameter table, and response shape, see Tokenized Completions and Constrained Decoding.
Anthropic Messages
Wafer also exposes an Anthropic-compatible Messages endpoint at https://pass.wafer.ai/v1/messages. Most users reach it through Claude Code or Conductor; see Agent Setup for the required environment variables.