
A massive-scale Vision-Language (VL) MoE model designed for complex multimodal instruction following. Featuring a total of 235B parameters with 22B active parameters per token, it delivers top-tier performance in image understanding, document parsing, and visual reasoning. Quantized via AWQ (Activation-aware Weight Quantization), it is optimized for 4-bit weight compression to enable large-scale multimodal deployment with high throughput.
Qwen3-VL-235B-A22B-Instruct-AWQ is the state-of-the-art (SOTA) multimodal MoE model from the Qwen3 series. Optimized with AWQ 4-bit quantization, this model allows for high-performance visual reasoning and text generation on a more accessible hardware footprint. It supports ultra-long contexts of up to 256K tokens, enabling it to analyze massive documents, long videos (hours long), and complex multi-image dialogues with second-level precision.
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Qwen3-VL-235B-A22B-Instruct-AWQ |
messages |
array |
Yes | Support for text, image_url, and video_url types. |
max_tokens |
integer |
No | Max response length. Supports up to 262,144 tokens. |
temperature |
float |
No | Recommended: 0.6 (for reasoning/thinking mode). |
top_p |
float |
No | Recommended: 0.95. |
repetition_penalty |
float |
No | Recommended: 1.05 - 1.1 to prevent MoE loops. |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.