Skip to main content
Use the Serverless API directly when you are building against Wafer from your own application, scripts, or low-level tooling. For Claude Code, Codex, Cline, Roo Code, and other agent harnesses, use Agent Setup instead.

Base URL

SurfaceURL
OpenAI-compatible APIhttps://pass.wafer.ai/v1
Anthropic-compatible Messages APIhttps://pass.wafer.ai/v1/messages
Send your API key on every request:
Authorization: Bearer <YOUR_WAFER_API_KEY>
To require Zero Data Retention for a single request, add:
Wafer-ZDR: required

List Models

curl -sS "https://pass.wafer.ai/v1/models" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>"
The model list is the source of truth for currently available Serverless model IDs. Each card layers Wafer-specific capabilities and pricing on top of the standard OpenAI shape:
{
  "object": "list",
  "data": [
    {
      "id": "GLM-5.1",
      "object": "model",
      "created": 1779148800,
      "owned_by": "wafer",
      "max_model_len": 202752,
      "zdr_supported": true,
      "wafer": {
        "display_name": "GLM-5.1",
        "description": "General Language Model 5.1 — high-quality bilingual (EN/ZH) generation with strong coding and reasoning capabilities.",
        "tier": "pass_included",
        "context_length": 202752,
        "capabilities": {
          "vision": false,
          "tools": true,
          "reasoning": true
        },
        "pricing": {
          "input_cents_per_million": 100,
          "output_cents_per_million": 320,
          "cache_read_cents_per_million": 10
        }
      }
    }
  ]
}
  • id, object, created, owned_by are the stable OpenAI fields — SDKs that only read these keep working.
  • max_model_len is the hard context-window cap; requests past it return context_length_exceeded.
  • zdr_supported: true means the model accepts Wafer-ZDR: required. Models without ZDR support omit the field or set it false.
  • wafer.capabilities.{vision, tools, reasoning} tell you what features the model actually supports — branch on these instead of hard-coding per-model logic.
  • wafer.pricing is in cents per million tokens and is what we’ll bill at; check it whenever pricing changes matter to your code path.

Chat Completions

Use POST /v1/chat/completions for ordinary text prompts and OpenAI-compatible clients:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [
      {"role": "user", "content": "Reply with the single word: ready."}
    ],
    "max_tokens": 16,
    "temperature": 0
  }'
Add Wafer-ZDR: required when the request must only route to ZDR-capable infrastructure:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Wafer-ZDR: required" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-397B-A17B",
    "messages": [{"role": "user", "content": "Summarize what Wafer does."}],
    "max_tokens": 128
  }'

Reasoning Controls

Reasoning-capable models can return a separate reasoning_content field alongside the final answer. Discover support from GET /v1/models by checking wafer.capabilities.reasoning. Wafer accepts three equivalent control shapes:
ShapeUse
thinking: {"type": "enabled"} or thinking: {"type": "disabled"}Recommended for simple on/off examples.
reasoning_effort set to none, low, medium, high, or maxUse when you want an explicit effort level.
enable_thinking: true or enable_thinking: falseQwen/DashScope-compatible shape.
Default behavior is reasoning off unless you explicitly enable it — for every reasoning-capable model. The same on/off curl shape works across reasoning-capable models. For example, swap the model value to deepseek-v4-pro to run DeepSeek V4 Pro with the same toggle. You can also use reasoning_effort (none, low, medium, high, or max) when you want an explicit effort level.
Where the reasoning text appears. Most reasoning-capable models (GLM-5.1, Kimi-K2.6, Qwen3.5-397B-A17B, Qwen3.6-35B-A3B, qwen3.6-max-preview, qwen3.7-max, deepseek-v4-flash, deepseek-v4-pro) return reasoning in a separate reasoning_content field on the assistant message.MiniMax-M3 is an exception: it currently returns reasoning inline in content as <think>…</think> text rather than in a separate field. If you’re parsing reasoning programmatically, branch on the model — or strip the <think> block from content before displaying.

Kimi-K2.6

With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kimi-K2.6",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kimi-K2.6",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

GLM-5.1

With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

Qwen3.5-397B-A17B

With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-397B-A17B",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-397B-A17B",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

DeepSeek V4 Pro

With reasoning off:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'
With reasoning on:
curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

Streaming

Set stream to true and add -N to receive server-sent events as they arrive:
curl -N -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Write a one-sentence haiku."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "stream": true
  }'
Usage chunks are always included on streaming requests. Wafer automatically sets stream_options: {"include_usage": true, "continuous_usage_stats": true} on every streaming chat completion so the final SSE chunk carries usage.{prompt_tokens, completion_tokens, total_tokens}. You don’t need to send stream_options yourself — and if you do, the auto-injected values still win. This means you can reliably bill / track token spend from streaming responses the same way you would from non-streaming.
Tool calls and streaming. When the model decides to call a tool, the full tool_calls array arrives in a single chunk rather than streamed argument- by-argument. Buffer the chunk before processing — partial tool-call deltas will not occur on Wafer.

Chat Request Body

FieldTypeRequiredNotes
modelstringYesAny Serverless model ID from GET /v1/models, such as GLM-5.1 or Qwen3.5-397B-A17B.
messagesarrayYesOpenAI-compatible chat messages with role and content.
max_tokensintegerNoMaximum generated tokens. Must be positive when provided.
temperaturenumberNoSampling temperature. Use 0 for deterministic decoding.
top_pnumberNoNucleus sampling cutoff.
top_kintegerNoLimits sampling to the top K candidate tokens on supported models.
min_pnumberNoMinimum probability threshold on supported models.
frequency_penaltynumberNoPenalizes repeated tokens by frequency.
presence_penaltynumberNoPenalizes tokens that have already appeared.
repetition_penaltynumberNoSGLang repetition penalty on supported models.
stopstring or arrayNoStop sequence or sequences.
streambooleanNoWhen true, returns streaming chat completion chunks.
toolsarrayNoOpenAI-compatible tool definitions on models that support tool calling.
tool_choicestring or objectNoControls tool selection for compatible models.
response_formatobjectNoUse JSON mode or structured outputs on compatible models.
logprobsbooleanNoRequest token log probabilities on compatible models.
top_logprobsintegerNoNumber of log probabilities to include when logprobs is enabled.
thinkingobject or booleanNoRecommended reasoning on/off control, for example {"type": "enabled"} or {"type": "disabled"}.
reasoning_effortstringNoReasoning effort: none, low, medium, high, or max.
enable_thinkingbooleanNoCompatibility reasoning on/off control.
preserve_thinkingbooleanNoWafer-shape for preserved reasoning across turns. Supported on Kimi-K2.6 and GLM-5.1. See Multi-turn preserved thinking.
seedintegerNoDeterministic sampling seed on backends that support it.
nintegerNoNumber of completions to generate. Stripped on Kimi-K2.6 (see caveat below).
logit_biasobjectNoToken-level logit adjustments. Supported on sglang backends.
parallel_tool_callsbooleanNoAllow the model to emit multiple tool_calls in a single response.
stream_optionsobjectNo{"include_usage": true} to include token counts in the final streaming chunk. Wafer sets this automatically on every streaming request — see Streaming.
Unsupported or model-specific parameters return a request error instead of being silently ignored — except where noted in Model-specific behavior.

Model-specific Behavior

A handful of routes intentionally diverge from the generic OpenAI/Anthropic contract. Know these before you ship.

Kimi-K2.6 sampling params are stripped

The non-ZDR Kimi-K2.6 route forwards to Moonshot’s hosted kimi-k2.6, which enforces fixed sampling values (temperature=1.0, top_p=0.95, n=1, presence_penalty=0, frequency_penalty=0) and rejects anything else. To keep behavior identical between the ZDR and non-ZDR Kimi backends, Wafer strips temperature, top_p, n, presence_penalty, and frequency_penalty from every Kimi-K2.6 request before forwarding. If you send temperature: 0 to Kimi-K2.6, expect Moonshot-default sampling (temperature=1.0) at the model. Either pick a model where those controls take effect (GLM-5.1, Qwen3.5-397B-A17B, etc.), or compensate with prompt engineering / reasoning_effort.

MiniMax-M3 returns inline <think> reasoning

See the caveat in Reasoning Controls above. MiniMax-M3 does not populate reasoning_content; it inlines <think>…</think> in content instead.

Multi-turn preserved thinking

Kimi-K2.6 and GLM-5.1 accept either preserve_thinking: true (Wafer shape) or thinking: {"type": "enabled", "keep": "all"} (Moonshot shape) to carry prior turns’ reasoning back into the next turn’s context. The previous turn’s reasoning_content is inlined as <think>…</think> inside the assistant message before the chat template runs, so the model can build on its own earlier chain of thought.
{
  "model": "Kimi-K2.6",
  "thinking": {"type": "enabled"},
  "preserve_thinking": true,
  "messages": [
    {"role": "user", "content": "Hard problem…"},
    {
      "role": "assistant",
      "content": "…final answer from turn 1…",
      "reasoning_content": "…chain of thought from turn 1…"
    },
    {"role": "user", "content": "Follow-up…"}
  ]
}
Default is off — reasoning is not preserved across turns unless you opt in.

Text Completions

Use POST /v1/completions only when you need token-ID prompts or constrained decoding on a supported route:
curl -sS "https://pass.wafer.ai/v1/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 2,
    "temperature": 0,
    "ebnf": "root ::= \"A\" | \"B\""
  }'
For the full /v1/completions request shape, streaming example, parameter table, and response shape, see Tokenized Completions and Constrained Decoding.

Anthropic Messages

Wafer also exposes an Anthropic-compatible Messages endpoint at https://pass.wafer.ai/v1/messages. Most users reach it through Claude Code or Conductor; see Agent Setup for the required environment variables.