• featured higgs-asr Robot

Higgs-Audio-v3-Speech-to-Text

Higgs-Audio-v3-Speech-to-Text is a high-performance automatic speech recognition (ASR) model developed by BosonAI. Built on a 1.7B parameter architecture, it delivers accurate transcription across 60+ languages with an OpenAI Whisper-compatible API interface.

$0.006 per minute

Input

Please upload an audio file

Output

BosonAI Higgs-Audio-V3-STT

Higgs-Audio v3 is a state-of-the-art Speech-to-Text (STT) foundation model developed by Boson AI. Unlike traditional ASR models, Higgs-Audio v3 is a "reasoning-capable" transcriber. It utilizes a 10B parameter architecture (optimized for inference) to not only convert speech into text but also to capture the nuances of speaker intent, emotion, and background context.

Key Capabilities

  • Massive Multilingual Support: Superior accuracy across 94 languages with automatic language identification.
  • Context-Aware Transcription: Outperforms Whisper-v3-Large by 20%+ in word error rate (WER) for complex, multi-speaker environments.
  • Semantic Understanding: Can output structured summaries or sentiment analysis directly from the audio stream.
  • Diarization & Timestamps: Built-in ability to distinguish between different speakers and provide word-level timestamps.
  • Dual-FFN Architecture: Inherits the high-efficiency design from the Higgs series, allowing for faster-than-real-time processing on standard GPUs.

Request Parameters

To use the Higgs-Audio v3 STT model, send a POST request to the /v1/audio/transcriptions endpoint with the following parameters.

Parameter Type Required Description
file binary Yes The audio file to transcribe (mp3, wav, flac, m4a). Max 25MB.
model string Yes Use "bosonai-higgs-audio-v3-stt".
language string No ISO 639-1 code (e.g., en, zh). If omitted, auto-detection is used.
response_format string No json, text, srt, vtt, or verbose_json.
prompt string No Optional text to guide the model's style or terminology.

Optional Parameters (Advanced)

Parameter Type Default Description
timestamp_granularities[] array ["segment"] Options: word, segment. Enables precise timing for each part.
temperature float 0.0 Controls sampling randomness. 0.0 is recommended for high accuracy.
diarization boolean false When true, the model labels different speakers (e.g., Speaker 1, Speaker 2).
sentiment_analysis boolean false If enabled, includes speaker emotion

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales