Qwen3-32B-FP8 API Documentation 🦾

Qwen3-32B-FP8 is a medium-scale flagship Dense Transformer model optimized for elite-level performance and high-concurrency inference. Featuring 32.8 billion parameters, it delivers a generational leap in coding, mathematics, and complex instruction following.

This version utilizes fine-grained FP8 quantization, offering a near-lossless experience with significantly reduced VRAM requirements and up to 1.8x faster inference on modern GPU architectures (NVIDIA Hopper/Ada Lovelace) compared to BF16.

Key Capabilities

Deep Reasoning: Integrated Native Thinking Mode for solving high-level mathematical and logical problems.
Enterprise-Grade Context: Natively supports a 128,000 token (128K) context window, perfect for long-form document analysis.
Instruction Precision: Highly responsive to complex system prompts and multi-step constraints.
Quantization Mastery: Optimized with activation-aware scaling to minimize perplexity loss in 8-bit precision.

Technical Specifications

Architecture: Dense Transformer (Non-MoE).
Quantization: FP8 (Fine-grained).
Context Window: 128,000 Tokens.
Minimum VRAM (Weights Only): ~32.8 GB.
Recommended VRAM (for 128K Context): 48 GB+ (e.g., A6000, A100 80GB, or RTX 6000 Ada).

Request Parameters (`/v1/chat/completions`)

Parameter	Type	Required	Description
`model`	`string`	Yes	Use `"Qwen/Qwen3-32B-FP8"`.
`messages`	`array`	Yes	Chat message objects (role & content).
`enable_thinking`	`boolean`	No	Default: true. Enables the `<think>` reasoning block.
`max_tokens`	`integer`	No	Max output tokens. Supports the 128K context window.
`temperature`	`float`	No	Randomness (0.0 - 2.0). Default: `0.7`.
`top_p`	`float`	No	Nucleus sampling threshold. Default: `0.95`.
`stream`	`boolean`	No	Enables real-time token streaming via SSE.

Optimization Scenarios

Scenario	Recommended Params	Purpose
Complex Reasoning	`enable_thinking: true`, `temp: 0.2`	Best for math, coding, and logical troubleshooting.
Long Doc Synthesis	`temperature: 0.3`, `max_tokens: 4096`	High accuracy for extracting insights from massive datasets.
Creative Ideation	`temperature: 0.85`, `presence_penalty: 0.2`	Boosts linguistic variety for marketing or storytelling.
Structured Data	`temperature: 0.0`, `top_p: 1.0`	Forces deterministic output for JSON/XML parsing.

Qwen3 32B

Input

Output

Qwen3-32B-FP8 API Documentation 🦾

Key Capabilities

Technical Specifications

Request Parameters (`/v1/chat/completions`)

Optimization Scenarios

Unlock the most affordable AI hosting

Qwen3 32B

Input

Output

Qwen3-32B-FP8 API Documentation 🦾

Key Capabilities

Technical Specifications

Request Parameters (/v1/chat/completions)

Optimization Scenarios

Unlock the most affordable AI hosting

Request Parameters (`/v1/chat/completions`)