• featured Qwen/Qwen3-32B-FP8 Robot

Qwen3-32B-FP8

A high-performance, dense Transformer model from the Qwen3 series featuring 32 billion parameters. This version is optimized with FP8 quantization, allowing it to fit within a ~32GB VRAM footprint while maintaining near-lossless perplexity. It serves as an ideal "workhorse" model for developers needing a balance between high-level reasoning and fast inference speeds for enterprise-grade chat and logic applications.

$0.24/M input tokens; $1.8/M output tokens

Input

You need to login to use this model.

Output

Start a conversation by sending a message

Qwen3-32B-FP8

Qwen3-32B-FP8 is a medium-scale flagship dense model optimized for high-performance inference. With 32 billion parameters, it offers a significant leap in coding, mathematics, and complex instruction following over previous generations. The FP8 precision allows for a smaller memory footprint and 1.8x faster inference speed on modern GPU architectures compared to BF16, making it ideal for scalable enterprise-grade applications.

Key Capabilities

  • Logic & Math: State-of-the-art performance in competitive programming and mathematical reasoning.
  • Instruction Following: Highly sensitive to complex, multi-step system prompts.
  • Vast Context: Natively supports up to 128,000 tokens context window.
  • Quantization Efficiency: Minimized accuracy drop with optimized FP8 activation-aware scaling.

Billing

Billed per 1M tokens (Input + Output).

Request Parameters

Parameter Type Required Description
model string Yes Must be Qwen/Qwen3-32B-FP8.
messages array Yes Chat message objects (role & content).
max_tokens integer No Max output tokens. Supports 128K context window.
temperature float No Randomness (0.0 - 2.0). Default: 0.7.
top_p float No Nucleus sampling threshold. Default: 0.8.
stream boolean No Enables real-time token streaming.

Optional Parameters (Qwen3 Optimization)

Qwen3-32B's dense architecture is robust across various tasks. Optimize with these settings:

Scenario Recommended Params Purpose
Code Debugging temperature: 0.1, top_p: 0.95 Ensures deterministic and syntactically correct code fixes.
Long Doc Synthesis temperature: 0.3, max_tokens: 4096 High accuracy for extracting insights from 100+ page PDFs.
Creative Ideation temperature: 0.85, presence_penalty: 0.2 Boosts linguistic variety for marketing or storytelling.
Conversational AI temperature: 0.7, stream: true Balanced tone with the lowest perceived latency.
Structured JSON temperature: 0.0, top_p: 1.0 Forces the model into its most logical state for data parsing.

Parameters Summary

  • model: Must be "Qwen/Qwen3-32B-FP8".
  • messages: Array of message objects (role & content).
  • temperature: Default 0.7.
  • top_p: Default 0.8.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales