Qwen3.5-27B-FP8 API Documentation

The Qwen3.5-27B-FP8 is a state-of-the-art Sparse Mixture-of-Experts (MoE) multimodal foundation model released by Alibaba Cloud. This variant utilizes fine-grained FP8 (8-bit Floating Point) quantization, allowing the 27-billion-parameter model to achieve elite-level reasoning while significantly reducing memory footprint and increasing inference throughput compared to BF16 models.

Key Features

Efficient Hybrid Architecture: Combines Gated Delta Networks with Sparse MoE, activating only a fraction of its 27B parameters per token for high-speed inference.
Unified Vision-Language: Natively processes both text and high-resolution images within a single transformer architecture.
262K Native Context: Supports 262,144 tokens natively, extensible up to 1.01 million tokens via RoPE scaling (e.g., YaRN).
Thinking Mode: Features an internal reasoning loop (Chain-of-Thought) to solve complex logic, coding, and mathematical problems.
Hardware Optimization: FP8 support enables high-performance serving on NVIDIA Hopper (H100), Blackwell, and consumer-grade Ada Lovelace (RTX 40-series) GPUs.

Request Parameters

The endpoint /v1/chat/completions accepts the following parameters in a JSON-encoded body.

Parameter	Type	Required	Description
`model`	`string`	Yes	Use `"Qwen/Qwen3.5-27B-FP8"`.
`messages`	`array`	Yes	List of message objects. Supports text and image inputs (via URL or base64).
`max_tokens`	`integer`	No	Limits generated response length. Suggested max: `8,192`.
`temperature`	`float`	No	`0.0` to `1.5`. Use `0.6` for standard tasks; `1.0+` for creative reasoning.
`stream`	`boolean`	No	If `true`, tokens are delivered via Server-Sent Events (SSE).

Advanced Parameters

Parameter	Type	Default	Description
`enable_thinking`	`boolean`	`true`	Enables the `<think>` reasoning block. Recommended for complex tasks.
`top_p`	`float`	`0.95`	Nucleus sampling: limits choices to the top 95% cumulative probability.
`presence_penalty`	`float`	`0.0`	Range: `-2.0` to `2.0`. Penalizes repeated topics.
`stop`	`array`	`null`	Up to 4 sequences where the API will stop generating tokens.
`language-model-only`	`boolean`	`false`	(Serving-level) Disables vision encoder to save VRAM for KV cache.

Model Specifications Summary

Metric	Specification
Architecture	Sparse Mixture-of-Experts (MoE)
Quantization	FP8 (Fine-grained, Block size 128)
Context Length	262,144 Tokens (Native)
Modalities	Text, Vision (Image)
Language Support	201+ Languages

Qwen3.5 27B

Input

Output

Qwen3.5-27B-FP8 API Documentation

Key Features

Request Parameters

Advanced Parameters

Model Specifications Summary

Unlock the most affordable AI hosting