
A state-of-the-art Large Language Model optimized for high-concurrency deployment. This version features the cutting-edge NVFP4 (NVIDIA FP4) quantization, specifically engineered for Blackwell and late-generation Hopper architectures. It delivers maximum token-per-second throughput while maintaining 230B-class intelligence, excelling in multi-turn dialogue consistency and complex instruction following.
Start a conversation by sending a message
MiniMax-M2.5-NVFP4 is the latest high-performance large language model from MiniMax, leveraging NVIDIA's FP4 (NVFP4) quantization technology. By utilizing 4-bit floating-point precision, this model achieves a breakthrough in inference speed and memory efficiency without compromising its sophisticated reasoning and long-context capabilities. It is particularly well-suited for low-latency real-time applications and massive document analysis.
Billed per 1K tokens (Input + Output). The NVFP4 architecture allows for high-throughput processing at a competitive price point. Please refer to the console pricing page for the latest rates.
POST /v1/chat/completions with Authorization: Bearer .model to "MiniMax-M2.5-NVFP4".| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Must be MiniMax-M2.5-NVFP4. |
messages |
array |
Yes | Array of message objects (role & content). Supports long context. |
max_tokens |
integer |
No | Maximum tokens to generate. Supports up to 128,000 contextual tokens. |
temperature |
float |
No | Controls randomness. Recommended: 0.3 for logic, 0.7 for general chat. |
top_p |
float |
No | Nucleus sampling threshold. Default is 0.95. |
stream |
boolean |
No | Enable real-time token streaming. Highly recommended for M2.5. |
The M2.5-NVFP4 is built for speed and long-context handling. Use these settings to maximize efficiency:
| Scenario | Recommended Params | Purpose |
|---|---|---|
| Real-time Chat | stream: true, temp: 0.7 |
Provides the fastest "typing" experience for end-users. |
| Long Document Analysis | max_tokens: 4000+, temp: 0.2 |
Leverages the NVFP4 efficiency for stable, long-form synthesis. |
| Structured Output (JSON) | temperature: 0.1 |
Ensures the model follows strict formatting rules without hallucination. |
| High Throughput | top_p: 0.9 |
Optimizes the token selection process for faster batch processing. |
"MiniMax-M2.5-NVFP4".0.7.Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.