MiniMax-M2.5-NVFP4 Robot

MiniMax-M2.5-NVFP4

A state-of-the-art Large Language Model optimized for high-concurrency deployment. This version features the cutting-edge NVFP4 (NVIDIA FP4) quantization, specifically engineered for Blackwell and late-generation Hopper architectures. It delivers maximum token-per-second throughput while maintaining 230B-class intelligence, excelling in multi-turn dialogue consistency and complex instruction following.

$0.3/M input tokens; $1.2/M output tokens

Input

You need to login to use this model.

Output

Start a conversation by sending a message

MiniMax-M2.5-NVFP4

MiniMax-M2.5-NVFP4 is the latest high-performance large language model from MiniMax, leveraging NVIDIA's FP4 (NVFP4) quantization technology. By utilizing 4-bit floating-point precision, this model achieves a breakthrough in inference speed and memory efficiency without compromising its sophisticated reasoning and long-context capabilities. It is particularly well-suited for low-latency real-time applications and massive document analysis.

Billing

Billed per 1K tokens (Input + Output). The NVFP4 architecture allows for high-throughput processing at a competitive price point. Please refer to the console pricing page for the latest rates.

Quick start

  1. Get an API key (Console → API Keys).
  2. Call POST /v1/chat/completions with Authorization: Bearer .
  3. Set model to "MiniMax-M2.5-NVFP4".

Request Parameters

Parameter Type Required Description
model string Yes Must be MiniMax-M2.5-NVFP4.
messages array Yes Array of message objects (role & content). Supports long context.
max_tokens integer No Maximum tokens to generate. Supports up to 128,000 contextual tokens.
temperature float No Controls randomness. Recommended: 0.3 for logic, 0.7 for general chat.
top_p float No Nucleus sampling threshold. Default is 0.95.
stream boolean No Enable real-time token streaming. Highly recommended for M2.5.

Optional Parameters

The M2.5-NVFP4 is built for speed and long-context handling. Use these settings to maximize efficiency:

Scenario Recommended Params Purpose
Real-time Chat stream: true, temp: 0.7 Provides the fastest "typing" experience for end-users.
Long Document Analysis max_tokens: 4000+, temp: 0.2 Leverages the NVFP4 efficiency for stable, long-form synthesis.
Structured Output (JSON) temperature: 0.1 Ensures the model follows strict formatting rules without hallucination.
High Throughput top_p: 0.9 Optimizes the token selection process for faster batch processing.

Parameters summary

  • model (required): Must be "MiniMax-M2.5-NVFP4".
  • messages (required): Array of objects representing the conversation.
  • max_tokens (optional): Maximum length of the generated response (supports up to 128K context).
  • temperature (optional): Controls randomness. Default: 0.7.
  • top_p (optional): Nucleus sampling threshold.
  • stream (optional): Whether to stream tokens in real-time.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales