Qwen3.5-122B-A10B-FP8 Robot

Qwen3.5 122B

The flagship Mixture-of-Experts (MoE) model from the Qwen3.5 series, featuring 122B total and 10B active parameters. This unified vision-language foundation excels in multimodal reasoning, complex coding, and native 'thinking mode' tasks. Utilizing fine-grained FP8 quantization, it offers exceptional throughput and reduced VRAM footprint on H100/L40S GPUs, while supporting a massive 262K context window for long-horizon agentic applications.

INPUT $0.400/1M TOKENS ;OUTPUT $3.200/1M TOKENS

Input

Output

Qwen3.5-122B-A10B-FP8 API Documentation

Qwen3.5-122B-A10B-FP8 is the flagship Sparse Mixture-of-Experts (MoE) multimodal foundation model from the Qwen3.5 series. By integrating 122 billion total parameters with 10 billion active parameters per token, it delivers elite-level reasoning capacity with the inference efficiency of a much smaller model.

This version is optimized with fine-grained FP8 quantization, specifically engineered for NVIDIA Blackwell and Hopper (H100/L40S) architectures to maximize throughput in complex multimodal reasoning and long-context applications.


Key Capabilities

  • Unified Vision-Language: Natively processes both text and high-resolution images within a single transformer foundation.
  • Native Thinking Mode: Features an internal reasoning loop (Chain-of-Thought) for superior performance in math, coding, and complex logic.
  • Massive 262K Context: Natively supports up to 262,144 tokens, extensible to 1.01 million tokens for massive document or codebase analysis.
  • Global Linguistic Coverage: Advanced optimization for 201+ languages and dialects, including industry-leading CJK and English support.
  • FP8 Efficiency: Significant VRAM reduction and hardware-accelerated throughput on modern data center GPUs.

Request Parameters (/v1/chat/completions)

Parameter Type Required Description
model string Yes Must be "Qwen3.5-122B-A10B-FP8".
messages array Yes Array of message objects. Supports text and image inputs.
max_tokens integer No Maximum tokens to generate. Suggested range: 1 to 8,192.
enable_thinking boolean No Default: true. Enables the <think> reasoning block.
temperature float No Controls randomness. 0.2 for logic, 0.7 for general chat.
top_p float No Nucleus sampling threshold. Default: 0.95.
stream boolean No Whether to stream tokens in real-time via SSE.

Optimization Scenarios

Scenario Recommended Params Purpose
Deep Reasoning enable_thinking: true, temp: 0.2 Best for complex math, logic, and multi-step coding problems.
Multimodal Analysis messages: [with image], temp: 0.4 Ideal for document OCR, chart reasoning, and visual QA.
Long Doc Synthesis max_tokens: 4096+, top_p: 0.9 Leverages the 262K context for accurate, long-form extraction.
Global Translation temperature: 0.3 Utilizes the 201-language support for nuanced, cultural translation.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales