• featured
Qwen/Qwen3-32B-FP8 Robot

Qwen3 32B

A high-performance, dense Transformer model from the Qwen3 series featuring 32.8 billion parameters. Optimized with fine-grained FP8 quantization, it reduces VRAM requirements for weights to ~32.8GB while supporting a native 128K context window. Featuring an integrated 'Thinking Mode' for deep reasoning, it serves as an ideal balance between complex logic and fast inference for high-concurrency enterprise applications.

INPUT $0.240/1M TOKENS; OUTPUT$1.800/1M TOKENS

Input

Template

Output

Qwen3-32B-FP8 API Documentation 🦾

Qwen3-32B-FP8 is a medium-scale flagship Dense Transformer model optimized for elite-level performance and high-concurrency inference. Featuring 32.8 billion parameters, it delivers a generational leap in coding, mathematics, and complex instruction following.

This version utilizes fine-grained FP8 quantization, offering a near-lossless experience with significantly reduced VRAM requirements and up to 1.8x faster inference on modern GPU architectures (NVIDIA Hopper/Ada Lovelace) compared to BF16.


Key Capabilities

  • Deep Reasoning: Integrated Native Thinking Mode for solving high-level mathematical and logical problems.
  • Enterprise-Grade Context: Natively supports a 128,000 token (128K) context window, perfect for long-form document analysis.
  • Instruction Precision: Highly responsive to complex system prompts and multi-step constraints.
  • Quantization Mastery: Optimized with activation-aware scaling to minimize perplexity loss in 8-bit precision.

Technical Specifications

  • Architecture: Dense Transformer (Non-MoE).
  • Quantization: FP8 (Fine-grained).
  • Context Window: 128,000 Tokens.
  • Minimum VRAM (Weights Only): ~32.8 GB.
  • Recommended VRAM (for 128K Context): 48 GB+ (e.g., A6000, A100 80GB, or RTX 6000 Ada).

Request Parameters (/v1/chat/completions)

Parameter Type Required Description
model string Yes Use "Qwen/Qwen3-32B-FP8".
messages array Yes Chat message objects (role & content).
enable_thinking boolean No Default: true. Enables the <think> reasoning block.
max_tokens integer No Max output tokens. Supports the 128K context window.
temperature float No Randomness (0.0 - 2.0). Default: 0.7.
top_p float No Nucleus sampling threshold. Default: 0.95.
stream boolean No Enables real-time token streaming via SSE.

Optimization Scenarios

Scenario Recommended Params Purpose
Complex Reasoning enable_thinking: true, temp: 0.2 Best for math, coding, and logical troubleshooting.
Long Doc Synthesis temperature: 0.3, max_tokens: 4096 High accuracy for extracting insights from massive datasets.
Creative Ideation temperature: 0.85, presence_penalty: 0.2 Boosts linguistic variety for marketing or storytelling.
Structured Data temperature: 0.0, top_p: 1.0 Forces deterministic output for JSON/XML parsing.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales