
A high-performance, dense Transformer model from the Qwen3 series featuring 32.8 billion parameters. Optimized with fine-grained FP8 quantization, it reduces VRAM requirements for weights to ~32.8GB while supporting a native 128K context window. Featuring an integrated 'Thinking Mode' for deep reasoning, it serves as an ideal balance between complex logic and fast inference for high-concurrency enterprise applications.
Template
Qwen3-32B-FP8 is a medium-scale flagship Dense Transformer model optimized for elite-level performance and high-concurrency inference. Featuring 32.8 billion parameters, it delivers a generational leap in coding, mathematics, and complex instruction following.
This version utilizes fine-grained FP8 quantization, offering a near-lossless experience with significantly reduced VRAM requirements and up to 1.8x faster inference on modern GPU architectures (NVIDIA Hopper/Ada Lovelace) compared to BF16.
/v1/chat/completions)| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Use "Qwen/Qwen3-32B-FP8". |
messages |
array |
Yes | Chat message objects (role & content). |
enable_thinking |
boolean |
No | Default: true. Enables the <think> reasoning block. |
max_tokens |
integer |
No | Max output tokens. Supports the 128K context window. |
temperature |
float |
No | Randomness (0.0 - 2.0). Default: 0.7. |
top_p |
float |
No | Nucleus sampling threshold. Default: 0.95. |
stream |
boolean |
No | Enables real-time token streaming via SSE. |
| Scenario | Recommended Params | Purpose |
|---|---|---|
| Complex Reasoning | enable_thinking: true, temp: 0.2 |
Best for math, coding, and logical troubleshooting. |
| Long Doc Synthesis | temperature: 0.3, max_tokens: 4096 |
High accuracy for extracting insights from massive datasets. |
| Creative Ideation | temperature: 0.85, presence_penalty: 0.2 |
Boosts linguistic variety for marketing or storytelling. |
| Structured Data | temperature: 0.0, top_p: 1.0 |
Forces deterministic output for JSON/XML parsing. |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.