
Higgs-Audio-v3-Speech-to-Text is a high-performance automatic speech recognition (ASR) model developed by BosonAI. Built on a 1.7B parameter architecture, it delivers accurate transcription across 60+ languages with an OpenAI Whisper-compatible API interface.
Please upload an audio file
Higgs-Audio v3 is a state-of-the-art Speech-to-Text (STT) foundation model developed by Boson AI. Unlike traditional ASR models, Higgs-Audio v3 is a "reasoning-capable" transcriber. It utilizes a 10B parameter architecture (optimized for inference) to not only convert speech into text but also to capture the nuances of speaker intent, emotion, and background context.
To use the Higgs-Audio v3 STT model, send a POST request to the /v1/audio/transcriptions endpoint with the following parameters.
| Parameter | Type | Required | Description |
|---|---|---|---|
file |
binary |
Yes | The audio file to transcribe (mp3, wav, flac, m4a). Max 25MB. |
model |
string |
Yes | Use "bosonai-higgs-audio-v3-stt". |
language |
string |
No | ISO 639-1 code (e.g., en, zh). If omitted, auto-detection is used. |
response_format |
string |
No | json, text, srt, vtt, or verbose_json. |
prompt |
string |
No | Optional text to guide the model's style or terminology. |
| Parameter | Type | Default | Description |
|---|---|---|---|
timestamp_granularities[] |
array |
["segment"] |
Options: word, segment. Enables precise timing for each part. |
temperature |
float |
0.0 |
Controls sampling randomness. 0.0 is recommended for high accuracy. |
diarization |
boolean |
false |
When true, the model labels different speakers (e.g., Speaker 1, Speaker 2). |
sentiment_analysis |
boolean |
false |
If enabled, includes speaker emotion |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.