
A high-performance, dense Transformer model from the Qwen3 series featuring 32 billion parameters. This version is optimized with FP8 quantization, allowing it to fit within a ~32GB VRAM footprint while maintaining near-lossless perplexity. It serves as an ideal "workhorse" model for developers needing a balance between high-level reasoning and fast inference speeds for enterprise-grade chat and logic applications.
Start a conversation by sending a message
Qwen3-32B-FP8 is a medium-scale flagship dense model optimized for high-performance inference. With 32 billion parameters, it offers a significant leap in coding, mathematics, and complex instruction following over previous generations. The FP8 precision allows for a smaller memory footprint and 1.8x faster inference speed on modern GPU architectures compared to BF16, making it ideal for scalable enterprise-grade applications.
Billed per 1M tokens (Input + Output).
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Must be Qwen/Qwen3-32B-FP8. |
messages |
array |
Yes | Chat message objects (role & content). |
max_tokens |
integer |
No | Max output tokens. Supports 128K context window. |
temperature |
float |
No | Randomness (0.0 - 2.0). Default: 0.7. |
top_p |
float |
No | Nucleus sampling threshold. Default: 0.8. |
stream |
boolean |
No | Enables real-time token streaming. |
Qwen3-32B's dense architecture is robust across various tasks. Optimize with these settings:
| Scenario | Recommended Params | Purpose |
|---|---|---|
| Code Debugging | temperature: 0.1, top_p: 0.95 |
Ensures deterministic and syntactically correct code fixes. |
| Long Doc Synthesis | temperature: 0.3, max_tokens: 4096 |
High accuracy for extracting insights from 100+ page PDFs. |
| Creative Ideation | temperature: 0.85, presence_penalty: 0.2 |
Boosts linguistic variety for marketing or storytelling. |
| Conversational AI | temperature: 0.7, stream: true |
Balanced tone with the lowest perceived latency. |
| Structured JSON | temperature: 0.0, top_p: 1.0 |
Forces the model into its most logical state for data parsing. |
"Qwen/Qwen3-32B-FP8".0.7.0.8.Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.