End-to-end platform for developing your AI applications

No matter where you start, build and scale your AI with ByteCompute.

Book a Demo Contact Sales

Explore AI Models Directory

All categories and models you can try out and seamlessly integrate in your projects

Featured

openai/whisper-large-v3-turbo

A fine-tuned version of the Whisper large-v3 model designed for near real-time Automatic Speech Recognition (ASR) and speech translation. By reducing the number of decoder layers while maintaining the robust encoder architecture, this "Turbo" variant offers a significant speedup (up to 8x faster) with minimal degradation in Word Error Rate (WER). Ideal for low-latency production

2 GB

809 M

Qwen3.5-122B-A10B-FP8

The latest flagship MoE (Mixture-of-Experts) model from the Qwen team. With a total of 122B parameters and 10B active parameters per token, it strikes an elite balance between reasoning throughput and model capacity. This build utilizes FP8 quantization, significantly reducing VRAM requirements and leveraging hardware acceleration on modern GPUs (H100/L40S) for high-performance inference in complex logic and coding tasks.

122GB

122B

MiniMax-M2.5-NVFP4

A state-of-the-art Large Language Model optimized for high-concurrency deployment. This version features the cutting-edge NVFP4 (NVIDIA FP4) quantization, specifically engineered for Blackwell and late-generation Hopper architectures. It delivers maximum token-per-second throughput while maintaining 230B-class intelligence, excelling in multi-turn dialogue consistency and complex instruction following.

134GB

230B

Featured

Qwen3-32B-FP8

A high-performance, dense Transformer model from the Qwen3 series featuring 32 billion parameters. This version is optimized with FP8 quantization, allowing it to fit within a ~32GB VRAM footprint while maintaining near-lossless perplexity. It serves as an ideal "workhorse" model for developers needing a balance between high-level reasoning and fast inference speeds for enterprise-grade chat and logic applications.

10GB

32B

Featured

LTX-2

A state-of-the-art Diffusion Transformer (DiT) foundation model with 22 billion parameters. Unlike traditional video models, LTX-2 is natively designed for synchronized audio-video generation within a single unified latent space. It excels at maintaining temporal consistency and high-fidelity motion, making it a powerful backend for creative AI pipelines that require seamless audiovisual coherence.

22GB

22B

Featured

Qwen3.5-27B-FP8

A lean, high-intelligence dense model optimized for complex instruction following and precise API calling. With 27B parameters compressed via FP8 quantization, it offers a superior logic-to-memory ratio, enabling elite-level performance on consumer-grade or mid-tier enterprise GPUs.

16G

27B

Qwen3-VL-235B-A22B-Instruct-AWQ

A massive-scale Vision-Language (VL) MoE model designed for complex multimodal instruction following. Featuring a total of 235B parameters with 22B active parameters per token, it delivers top-tier performance in image understanding, document parsing, and visual reasoning. Quantized via AWQ (Activation-aware Weight Quantization), it is optimized for 4-bit weight compression to enable large-scale multimodal deployment with high throughput.

125GB

235B

Ready to Accelerate AI in Your Organization?

Contact our sales team to discuss your enterprise needs and deployment options.

Get Started