Tokenized Completions and Constrained Decoding

Dedicated endpoints expose OpenAI-compatible inference at https://<ENDPOINT_HOST>/v1. On supported routes such as GLM-5.1, you can send pre-tokenized prompts to /v1/completions and constrain decoding with SGLang/XGrammar-compatible EBNF by passing ebnf.

Curl Request

curl -sS "https://<ENDPOINT_HOST>/v1/completions" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 2,
    "temperature": 0,
    "ebnf": "root ::= \"A\" | \"B\""
  }'

Set stream to true and add -N to stream text completion chunks as server-sent events:

curl -N -sS "https://<ENDPOINT_HOST>/v1/completions" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 16,
    "temperature": 0.2,
    "stream": true
  }'

Request Body

Field	Type	Required	Notes
`model`	string	Yes	Use a model ID configured on your dedicated endpoint.
`prompt`	array	Yes	Non-empty array of non-negative token IDs for one request, or an array of token-ID arrays for batched requests. Text prompts should use `/v1/chat/completions`.
`max_tokens`	integer	No	Maximum generated tokens. Must be positive when provided.
`min_tokens`	integer	No	Minimum generated tokens before stop conditions can end generation.
`temperature`	number	No	Sampling temperature. Use `0` for deterministic decoding.
`top_p`	number	No	Nucleus sampling cutoff.
`top_k`	integer	No	Limits sampling to the top K candidate tokens.
`min_p`	number	No	Minimum probability threshold for candidate tokens.
`frequency_penalty`	number	No	Penalizes tokens based on frequency in the generated text.
`presence_penalty`	number	No	Penalizes tokens that have already appeared.
`repetition_penalty`	number	No	SGLang repetition penalty.
`stop`	string or array	No	Stop sequence or sequences.
`stop_token_ids`	array	No	Stop generation when one of these token IDs is emitted.
`stream`	boolean	No	When `true`, returns streaming completion chunks.
`ebnf`	string	No	SGLang/XGrammar-compatible grammar.
`regex`	string	No	Regex constraint for constrained decoding.
`json_schema`	object	No	JSON schema constraint for structured output.
`logit_bias`	object	No	Token logit adjustments.
`n`	integer	No	Number of completions to generate.
`skip_special_tokens`	boolean	No	Controls whether special tokens are removed from output text.

Advanced SGLang passthrough fields are also accepted when you need lower-level control: custom_params, ignore_eos, no_stop_trim, spaces_between_special_tokens, stop_regex, structural_tag, custom_logit_processor, logprob_start_len, lora_path, priority, return_hidden_states, return_logprob, return_routed_experts, return_text_in_logprobs, rid, token_ids_logprob, and top_logprobs_num.

prompt must be token IDs on this endpoint. A non-empty array like [9703] is valid; ["hello"], an empty array, booleans, negative integers, and mixed token/string arrays are rejected.

Response Shape

Non-streaming responses use the OpenAI text completion shape:

{
  "id": "<request_id>",
  "object": "text_completion",
  "created": 1770000000,
  "model": "GLM-5.1",
  "choices": [
    {
      "index": 0,
      "text": "A",
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": null
    }
  ],
  "usage": {
    "prompt_tokens": 1,
    "completion_tokens": 1,
    "total_tokens": 2,
    "prompt_tokens_details": {"cached_tokens": 0},
    "reasoning_tokens": 0
  }
}

Use the model IDs and capabilities configured for your dedicated endpoint. If a model route on your endpoint does not support /v1/completions, use the standard chat completions path instead.

​Curl Request

​Request Body

​Response Shape

Curl Request

Request Body

Response Shape