Higgs Audio V2.5 is a 1B parameter autoregressive audio transformer distilled from the 3B V2 model, featuring the DualFFN architecture for efficient acoustic token modeling. It uses a unified audio tokenizer running at 25 FPS with 12 codebooks at 2000 bps, outputting 24kHz audio. Trained on 1M+ hours of audio data (AudioWeb dataset) with GRPO alignment for human-like naturalness.
Template
Generate speech to see the audio player.
Higgs Audio V2.5 is a 1B parameter autoregressive audio transformer distilled from the 3B V2 model. Featuring the DualFFN architecture and trained on 1M+ hours of AudioWeb data with GRPO alignment, it delivers human-like naturalness and low-latency speech synthesis.
This endpoint follows the OpenAI-compatible /v1/chat/completions schema, outputting high-fidelity 24kHz audio via a unified 25 FPS tokenizer.
$0.045 / audio minute
Billed based on the total duration of generated audio.
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Must be "boson/higgs-audio-v2.5". |
messages |
array |
Yes | Chat-formatted input. The model generates audio based on the last user message. |
modalities |
array |
Yes | Must include ["text", "audio"] to enable TTS mode. |
audio |
object |
No | Specifies audio details such as voice (e.g., alloy, echo) and format (wav, mp3). |
temperature |
float |
No | Controls randomness in prosody and naturalness. Recommended: 1.0. |
top_p |
float |
No | Nucleus sampling threshold. Recommended: 0.95 for optimal acoustic stability. |
max_completion_tokens |
integer |
No | Limits the total tokens generated (indirectly limits audio duration). |
extra_body |
object |
No | Model-specific parameters, such as {"top_k": 50} for decoding efficiency. |
stop |
array |
No | Stop sequences, e.g., `["< |
| Field | Type | Description |
|---|---|---|
id |
string |
Unique identifier for the request. |
object |
string |
Always "chat.completion". |
created |
integer |
Unix timestamp of the request. |
model |
string |
The exact model version executed. |
choices |
array |
List of generated outputs (typically contains one item). |
usage |
object |
Statistics including prompt_tokens and completion_tokens. |
| Field | Type | Description |
|---|---|---|
data |
string |
Base64-encoded audio data. Default sampling rate is 24kHz. |
id |
string |
Unique ID for the audio resource. |
expires_at |
integer |
Expiration timestamp for the audio data (if cached). |
transcript |
string |
The normalized text content corresponding to the generated audio. |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.
