• featured
Qwen3.5-27B-FP8 Robot

Qwen3.5 27B

A high-intelligence sparse Mixture-of-Experts (MoE) model optimized for advanced reasoning, complex instruction following, and precise tool use. With 27B parameters and fine-grained FP8 quantization, it features a 262K native context window and native 'thinking mode' support, delivering elite-level logic and linguistic performance with exceptional inference efficiency.

INPUT $0.300/1M TOKENS;OUTPUT $2.400/1M TOKENS

Input

Output

Qwen3.5-27B-FP8 API Documentation

The Qwen3.5-27B-FP8 is a state-of-the-art Sparse Mixture-of-Experts (MoE) multimodal foundation model released by Alibaba Cloud. This variant utilizes fine-grained FP8 (8-bit Floating Point) quantization, allowing the 27-billion-parameter model to achieve elite-level reasoning while significantly reducing memory footprint and increasing inference throughput compared to BF16 models.


Key Features

  • Efficient Hybrid Architecture: Combines Gated Delta Networks with Sparse MoE, activating only a fraction of its 27B parameters per token for high-speed inference.
  • Unified Vision-Language: Natively processes both text and high-resolution images within a single transformer architecture.
  • 262K Native Context: Supports 262,144 tokens natively, extensible up to 1.01 million tokens via RoPE scaling (e.g., YaRN).
  • Thinking Mode: Features an internal reasoning loop (Chain-of-Thought) to solve complex logic, coding, and mathematical problems.
  • Hardware Optimization: FP8 support enables high-performance serving on NVIDIA Hopper (H100), Blackwell, and consumer-grade Ada Lovelace (RTX 40-series) GPUs.

Request Parameters

The endpoint /v1/chat/completions accepts the following parameters in a JSON-encoded body.

Parameter Type Required Description
model string Yes Use "Qwen/Qwen3.5-27B-FP8".
messages array Yes List of message objects. Supports text and image inputs (via URL or base64).
max_tokens integer No Limits generated response length. Suggested max: 8,192.
temperature float No 0.0 to 1.5. Use 0.6 for standard tasks; 1.0+ for creative reasoning.
stream boolean No If true, tokens are delivered via Server-Sent Events (SSE).

Advanced Parameters

Parameter Type Default Description
enable_thinking boolean true Enables the <think> reasoning block. Recommended for complex tasks.
top_p float 0.95 Nucleus sampling: limits choices to the top 95% cumulative probability.
presence_penalty float 0.0 Range: -2.0 to 2.0. Penalizes repeated topics.
stop array null Up to 4 sequences where the API will stop generating tokens.
language-model-only boolean false (Serving-level) Disables vision encoder to save VRAM for KV cache.

Model Specifications Summary

Metric Specification
Architecture Sparse Mixture-of-Experts (MoE)
Quantization FP8 (Fine-grained, Block size 128)
Context Length 262,144 Tokens (Native)
Modalities Text, Vision (Image)
Language Support 201+ Languages

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales