API Reference

Use the Serverless API directly when you are building against Wafer from your own application, scripts, or low-level tooling. For Claude Code, Codex, Cline, Roo Code, and other agent harnesses, use Agent Setup instead.

Base URL

Surface	URL
OpenAI-compatible API	`https://pass.wafer.ai/v1`
Anthropic-compatible Messages API	`https://pass.wafer.ai/v1/messages`

Send your API key on every request:

Authorization: Bearer <YOUR_WAFER_API_KEY>

To require Zero Data Retention for a single request, add:

Wafer-ZDR: required

List Models

curl -sS "https://pass.wafer.ai/v1/models" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>"

The model list is the source of truth for currently available Serverless model IDs. Each card layers Wafer-specific capabilities and pricing on top of the standard OpenAI shape:

{
  "object": "list",
  "data": [
    {
      "id": "GLM-5.1",
      "object": "model",
      "created": 1779148800,
      "owned_by": "wafer",
      "max_model_len": 202752,
      "zdr_supported": true,
      "wafer": {
        "display_name": "GLM-5.1",
        "description": "General Language Model 5.1 — high-quality bilingual (EN/ZH) generation with strong coding and reasoning capabilities.",
        "tier": "pass_included",
        "context_length": 202752,
        "capabilities": {
          "vision": false,
          "tools": true,
          "reasoning": true,
          "chat_completions": {
            "supported": true,
            "streaming": true,
            "tools": true,
            "tool_streaming": true,
            "json_schema": true,
            "json_schema_refs": true,
            "grammar": true,
            "regex": true,
            "tools_with_response_format": true,
            "n": true
          },
          "messages": {
            "supported": true,
            "streaming": true,
            "tools": true
          },
          "responses": {
            "supported": true,
            "streaming": true,
            "text_format": ["text", "json_object", "json_schema"],
            "raw_json_schema_text": true
          },
          "zdr": {
            "supported": true,
            "same_capabilities": true
          }
        },
        "pricing": {
          "input_cents_per_million": 100,
          "output_cents_per_million": 320,
          "cache_read_cents_per_million": 10
        }
      }
    }
  ]
}

id, object, created, owned_by are the stable OpenAI fields — SDKs that only read these keep working.
max_model_len is the hard context-window cap; requests past it return context_length_exceeded.
zdr_supported: true means the model accepts Wafer-ZDR: required. Models without ZDR support omit the field or set it false.
wafer.capabilities.{vision, tools, reasoning} are the legacy summary flags. Newer model cards also include per-surface flags under chat_completions, messages, responses, and zdr; branch on those when using structured outputs, grammar, regex, tools, or ZDR-specific behavior.
wafer.pricing is in cents per million tokens and is what we’ll bill at; check it whenever pricing changes matter to your code path.

Compatibility Notes

pass.wafer.ai validates and normalizes a few model-specific features before a request is sent upstream:

Safe local JSON Schema references (#/$defs/... and #/definitions/...) are automatically inlined for tool schemas and structured outputs. Remote, unresolved, and recursive refs are rejected.
response_format.type = "grammar" is supported only on models whose wafer.capabilities.chat_completions.grammar flag is true.
Top-level regex is rejected when the selected model or ZDR partition would ignore it. For example, Kimi-K2.6 supports regex on the ZDR self-hosted partition but not on the non-ZDR Moonshot partition.
When tools and response_format are both present, tools keep OpenAI-style precedence so a tool-selected request can still return tool calls.
n > 1 is passed through only on models whose wafer.capabilities.chat_completions.n flag is true. Unsupported models fail fast with unsupported_feature and param: "n" instead of silently returning one choice.
OpenAI-compatible role: "tool" messages may send content: null; Wafer normalizes that to an empty tool result before dispatch so common SDK histories keep working.
/v1/responses with text.format.type = "json_schema" returns raw JSON text instead of wrapping JSON in Markdown fences.

Chat Completions

Use POST /v1/chat/completions for ordinary text prompts and OpenAI-compatible clients:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [
      {"role": "user", "content": "Reply with the single word: ready."}
    ],
    "max_tokens": 16,
    "temperature": 0
  }'

Add Wafer-ZDR: required when the request must only route to ZDR-capable infrastructure:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Wafer-ZDR: required" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.2",
    "messages": [{"role": "user", "content": "Summarize what Wafer does."}],
    "max_tokens": 128
  }'

Reasoning Controls

Reasoning-capable models can return a separate reasoning_content field alongside the final answer. Discover support from GET /v1/models by checking wafer.capabilities.reasoning. Wafer accepts three equivalent control shapes:

Shape	Use
`thinking: {"type": "enabled"}` or `thinking: {"type": "disabled"}`	Recommended for simple on/off examples.
`reasoning_effort` set to `none`, `low`, `medium`, `high`, or `max`	Use when you want an explicit effort level.
`enable_thinking: true` or `enable_thinking: false`	Qwen/DashScope-compatible shape.

Default behavior is reasoning off unless you explicitly enable it — for every reasoning-capable model. The same on/off curl shape works across reasoning-capable models. For example, swap the model value to GLM-5.2 to run the 1M-context GLM route with the same toggle. You can also use reasoning_effort (none, low, medium, high, or max) when you want an explicit effort level.

Where the reasoning text appears. Most reasoning-capable models (GLM-5.1, GLM-5.2, glm5.2-fast, Kimi-K2.6, Kimi-K2.7-Code, Qwen3.6-35B-A3B, and qwen3.7-max) return reasoning in a separate reasoning_content field on the assistant message.MiniMax-M3 is an exception: it currently returns reasoning inline in content as <think>…</think> text rather than in a separate field. If you’re parsing reasoning programmatically, branch on the model — or strip the <think> block from content before displaying.

Kimi-K2.6

With reasoning off:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kimi-K2.6",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'

With reasoning on:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kimi-K2.6",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

Kimi-K2.7-Code

Kimi-K2.7-Code is a coding-focused model with reasoning always on — there is no reasoning-off mode. You don’t need to send thinking/reasoning_effort; any attempt to disable thinking is treated as enabled. Give it room for the reasoning pass with a generous max_tokens.

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Kimi-K2.7-Code",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 512
  }'

Tool use must be model-decided. Because thinking is always on, Kimi-K2.7-Code rejects forced tool calls — tool_choice: "required" and a specific {"type": "function", ...} choice both return 400 with code: "unsupported_feature" and param: "tool_choice". Pass your tools with tool_choice: "auto" (or "none") and let the model decide.

GLM-5.1

With reasoning off:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'

With reasoning on:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

Qwen3.5-397B-A17B

With reasoning off:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-397B-A17B",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'

With reasoning on:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-397B-A17B",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

GLM-5.2

With reasoning off:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.2",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 64,
    "thinking": {"type": "disabled"}
  }'

With reasoning on:

curl -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.2",
    "messages": [{"role": "user", "content": "Reply with exactly: ok"}],
    "max_tokens": 256,
    "thinking": {"type": "enabled"}
  }'

Streaming

Set stream to true and add -N to receive server-sent events as they arrive:

curl -N -sS "https://pass.wafer.ai/v1/chat/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "messages": [{"role": "user", "content": "Write a one-sentence haiku."}],
    "max_tokens": 64,
    "temperature": 0.7,
    "stream": true
  }'

Usage chunks are always included on streaming requests. Wafer automatically sets stream_options: {"include_usage": true, "continuous_usage_stats": true} on every streaming chat completion so the final SSE chunk carries usage.{prompt_tokens, completion_tokens, total_tokens}. You don’t need to send stream_options yourself — and if you do, the auto-injected values still win. This means you can reliably bill / track token spend from streaming responses the same way you would from non-streaming.

Tool calls and streaming. When the model decides to call a tool, the full tool_calls array arrives in a single chunk rather than streamed argument- by-argument. Buffer the chunk before processing — partial tool-call deltas will not occur on Wafer.

Chat Request Body

Field	Type	Required	Notes
`model`	string	Yes	Any Serverless model ID from `GET /v1/models`, such as `GLM-5.1` or `Qwen3.5-397B-A17B`.
`messages`	array	Yes	OpenAI-compatible chat messages with `role` and `content`.
`max_tokens`	integer	No	Maximum generated tokens. Must be positive when provided.
`temperature`	number	No	Sampling temperature. Use `0` for deterministic decoding.
`top_p`	number	No	Nucleus sampling cutoff.
`top_k`	integer	No	Limits sampling to the top K candidate tokens on supported models.
`min_p`	number	No	Minimum probability threshold on supported models.
`frequency_penalty`	number	No	Penalizes repeated tokens by frequency.
`presence_penalty`	number	No	Penalizes tokens that have already appeared.
`repetition_penalty`	number	No	SGLang repetition penalty on supported models.
`stop`	string or array	No	Stop sequence or sequences.
`stream`	boolean	No	When `true`, returns streaming chat completion chunks.
`tools`	array	No	OpenAI-compatible tool definitions on models that support tool calling.
`tool_choice`	string or object	No	Controls tool selection for compatible models.
`response_format`	object	No	Use JSON mode or structured outputs on compatible models.
`logprobs`	boolean	No	Request token log probabilities on compatible models.
`top_logprobs`	integer	No	Number of log probabilities to include when `logprobs` is enabled.
`thinking`	object or boolean	No	Recommended reasoning on/off control, for example `{"type": "enabled"}` or `{"type": "disabled"}`.
`reasoning_effort`	string	No	Reasoning effort: `none`, `low`, `medium`, `high`, or `max`.
`enable_thinking`	boolean	No	Compatibility reasoning on/off control.
`preserve_thinking`	boolean	No	Wafer-shape for preserved reasoning across turns. Supported on `Kimi-K2.6`, `Kimi-K2.7-Code`, and `GLM-5.1`. See Multi-turn preserved thinking.
`seed`	integer	No	Deterministic sampling seed on backends that support it.
`n`	integer	No	Number of completions to generate. Stripped on Kimi-K2.6 (see caveat below).
`logit_bias`	object	No	Token-level logit adjustments. Supported on sglang backends.
`parallel_tool_calls`	boolean	No	Allow the model to emit multiple `tool_calls` in a single response.
`stream_options`	object	No	`{"include_usage": true}` to include token counts in the final streaming chunk. Wafer sets this automatically on every streaming request — see Streaming.

Unsupported or model-specific parameters return a request error instead of being silently ignored — except where noted in Model-specific behavior.

Model-specific Behavior

A handful of routes intentionally diverge from the generic OpenAI/Anthropic contract. Know these before you ship.

Kimi-K2.6 sampling params are stripped

Kimi-K2.6 forwards to Moonshot’s hosted kimi-k2.6, which enforces fixed sampling values (temperature=1.0, top_p=0.95, n=1, presence_penalty=0, frequency_penalty=0) and rejects anything else. Wafer strips temperature, top_p, n, presence_penalty, and frequency_penalty from Kimi-K2.6 requests before forwarding. If you send temperature: 0 to Kimi-K2.6, expect Moonshot-default sampling (temperature=1.0) at the model. Either pick a model where those controls take effect (GLM-5.1, Qwen3.5-397B-A17B, etc.), or compensate with prompt engineering / reasoning_effort.

MiniMax-M3 returns inline `<think>` reasoning

See the caveat in Reasoning Controls above. MiniMax-M3 does not populate reasoning_content; it inlines <think>…</think> in content instead.

Multi-turn preserved thinking

Kimi-K2.6, Kimi-K2.7-Code, and GLM-5.1 accept either preserve_thinking: true (Wafer shape) or thinking: {"type": "enabled", "keep": "all"} (Moonshot shape) to carry prior turns’ reasoning back into the next turn’s context. The previous turn’s reasoning_content is inlined as <think>…</think> inside the assistant message before the chat template runs, so the model can build on its own earlier chain of thought.

{
  "model": "Kimi-K2.6",
  "thinking": {"type": "enabled"},
  "preserve_thinking": true,
  "messages": [
    {"role": "user", "content": "Hard problem…"},
    {
      "role": "assistant",
      "content": "…final answer from turn 1…",
      "reasoning_content": "…chain of thought from turn 1…"
    },
    {"role": "user", "content": "Follow-up…"}
  ]
}

Default is off — reasoning is not preserved across turns unless you opt in.

JSON Schema references in tools and structured outputs

Wafer Serverless accepts common MCP, Zod, and Pydantic JSON Schemas that use safe local references such as #/$defs/... or #/definitions/.... For compatible models, Wafer inlines those local definitions before dispatching the request upstream. Remote references and recursive schemas are not supported. Inline those schemas client-side or simplify them before retrying.

Text Completions

Use POST /v1/completions only when you need token-ID prompts or constrained decoding on a supported route:

curl -sS "https://pass.wafer.ai/v1/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 2,
    "temperature": 0,
    "ebnf": "root ::= \"A\" | \"B\""
  }'

For the full /v1/completions request shape, streaming example, parameter table, and response shape, see Tokenized Completions and Constrained Decoding.

Anthropic Messages

Wafer also exposes an Anthropic-compatible Messages endpoint at https://pass.wafer.ai/v1/messages. Most users reach it through Claude Code or Conductor; see Agent Setup for the required environment variables.

​Base URL

​List Models

​Compatibility Notes

​Chat Completions

​Reasoning Controls

​Kimi-K2.6

​Kimi-K2.7-Code

​GLM-5.1

​Qwen3.5-397B-A17B

​GLM-5.2

​Streaming

​Chat Request Body

​Model-specific Behavior

​Kimi-K2.6 sampling params are stripped

​MiniMax-M3 returns inline <think> reasoning

​Multi-turn preserved thinking

​JSON Schema references in tools and structured outputs

​Text Completions

​Anthropic Messages

Base URL

List Models

Compatibility Notes

Chat Completions

Reasoning Controls

Kimi-K2.6

Kimi-K2.7-Code

GLM-5.1

Qwen3.5-397B-A17B

GLM-5.2

Streaming

Chat Request Body

Model-specific Behavior

Kimi-K2.6 sampling params are stripped

MiniMax-M3 returns inline `<think>` reasoning

Multi-turn preserved thinking

JSON Schema references in tools and structured outputs

Text Completions

Anthropic Messages