
A high-intelligence sparse Mixture-of-Experts (MoE) model optimized for advanced reasoning, complex instruction following, and precise tool use. With 27B parameters and fine-grained FP8 quantization, it features a 262K native context window and native 'thinking mode' support, delivering elite-level logic and linguistic performance with exceptional inference efficiency.
The Qwen3.5-27B-FP8 is a state-of-the-art Sparse Mixture-of-Experts (MoE) multimodal foundation model released by Alibaba Cloud. This variant utilizes fine-grained FP8 (8-bit Floating Point) quantization, allowing the 27-billion-parameter model to achieve elite-level reasoning while significantly reducing memory footprint and increasing inference throughput compared to BF16 models.
The endpoint /v1/chat/completions accepts the following parameters in a JSON-encoded body.
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Use "Qwen/Qwen3.5-27B-FP8". |
messages |
array |
Yes | List of message objects. Supports text and image inputs (via URL or base64). |
max_tokens |
integer |
No | Limits generated response length. Suggested max: 8,192. |
temperature |
float |
No | 0.0 to 1.5. Use 0.6 for standard tasks; 1.0+ for creative reasoning. |
stream |
boolean |
No | If true, tokens are delivered via Server-Sent Events (SSE). |
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_thinking |
boolean |
true |
Enables the <think> reasoning block. Recommended for complex tasks. |
top_p |
float |
0.95 |
Nucleus sampling: limits choices to the top 95% cumulative probability. |
presence_penalty |
float |
0.0 |
Range: -2.0 to 2.0. Penalizes repeated topics. |
stop |
array |
null |
Up to 4 sequences where the API will stop generating tokens. |
language-model-only |
boolean |
false |
(Serving-level) Disables vision encoder to save VRAM for KV cache. |
| Metric | Specification |
|---|---|
| Architecture | Sparse Mixture-of-Experts (MoE) |
| Quantization | FP8 (Fine-grained, Block size 128) |
| Context Length | 262,144 Tokens (Native) |
| Modalities | Text, Vision (Image) |
| Language Support | 201+ Languages |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.