MiniMax-M2.5-NVFP4

MiniMax-M2.5-NVFP4 is the latest high-performance large language model from MiniMax, leveraging NVIDIA's FP4 (NVFP4) quantization technology. By utilizing 4-bit floating-point precision, this model achieves a breakthrough in inference speed and memory efficiency without compromising its sophisticated reasoning and long-context capabilities. It is particularly well-suited for low-latency real-time applications and massive document analysis.

Billing

Billed per 1K tokens (Input + Output). The NVFP4 architecture allows for high-throughput processing at a competitive price point. Please refer to the console pricing page for the latest rates.

Quick start

Get an API key (Console → API Keys).
Call POST /v1/chat/completions with Authorization: Bearer .
Set model to "MiniMax-M2.5-NVFP4".

Request Parameters

Parameter	Type	Required	Description
`model`	`string`	Yes	Must be `MiniMax-M2.5-NVFP4`.
`messages`	`array`	Yes	Array of message objects (role & content). Supports long context.
`max_tokens`	`integer`	No	Maximum tokens to generate. Supports up to 128,000 contextual tokens.
`temperature`	`float`	No	Controls randomness. Recommended: 0.3 for logic, 0.7 for general chat.
`top_p`	`float`	No	Nucleus sampling threshold. Default is 0.95.
`stream`	`boolean`	No	Enable real-time token streaming. Highly recommended for M2.5.

Optional Parameters

The M2.5-NVFP4 is built for speed and long-context handling. Use these settings to maximize efficiency:

Scenario	Recommended Params	Purpose
Real-time Chat	`stream: true`, `temp: 0.7`	Provides the fastest "typing" experience for end-users.
Long Document Analysis	`max_tokens: 4000+`, `temp: 0.2`	Leverages the NVFP4 efficiency for stable, long-form synthesis.
Structured Output (JSON)	`temperature: 0.1`	Ensures the model follows strict formatting rules without hallucination.
High Throughput	`top_p: 0.9`	Optimizes the token selection process for faster batch processing.

Parameters summary

model (required): Must be "MiniMax-M2.5-NVFP4".
messages (required): Array of objects representing the conversation.
max_tokens (optional): Maximum length of the generated response (supports up to 128K context).
temperature (optional): Controls randomness. Default: 0.7.
top_p (optional): Nucleus sampling threshold.
stream (optional): Whether to stream tokens in real-time.

MiniMax-M2.5-NVFP4

Input

Output

MiniMax-M2.5-NVFP4

Billing

Quick start

Request Parameters

Optional Parameters

Parameters summary

Unlock the most affordable AI hosting