Documentation

Batch Inference

Learn how to use the Batch API to send asynchronous groups of requests with 50% lower costs, higher rate limits, and flexible completion windows. The service is ideal for processing jobs that don't require immediate responses.

Overview

The Batch API enables you to process large volumes of requests asynchronously at 50% lower cost compared to real-time API calls. It's perfect for workloads that don't need immediate responses such as:

  • Running evaluations and data analysis
  • Classifying large datasets
  • Offline summarization
  • Synthetic data generation
  • Content generation for marketing
  • Dataset processing and transformations

Compared to using standard endpoints directly, Batch API offers:

  • Better cost efficiency: 50% cost discount compared to synchronous APIs
  • Higher rate limits: Substantially more headroom with separate rate limit pools
  • Large-scale support: Process thousands of requests per batch
  • Flexible completion: Best-effort completion within 24 hours with progress tracking

Getting started

Note: Make sure your bytecompute version number is >1.5.13. Run pip install bytecompute --upgradeto upgrade if needed.

1. Prepare your batch file

Batches start with a .jsonl file where each line contains the details of an individual request to the API. The available endpoint is /v1/chat/completions (Chat Completions API). Each request must include a unique custom_id value, which you can use to reference results after completion. Here's an example of an input file with 2 requests:

JSON Copy
{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 200}}
{"custom_id": "request-2", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 200}}

Each line in your batch file must follow this schema:

Field Type Required Description
custom_id string Yes Unique identifier for tracking (max 64 chars)
body object Yes The request body matching the endpoint's schema

2. Upload your batch input file

You must first upload your input file so that you can reference it correctly when creating batches. Upload your .jsonl file using the Files API with purpose=batch-api.

Upload files for Batch API

Python Copy
from bytecompute import bytecompute

client = bytecompute()

## Uploads batch job file
file_resp = client.files.upload(file="batch_input.jsonl", purpose="batch-api") 
SH Copy
bytecompute files upload batch_input.jsonl --purpose "batch-api"

This will return a file object with id and other details:

Python Copy
FileResponse(
  id='file-fa37fdce-89cb-414b-923c-2add62250155', 
  object=<ObjectType.File: 'file'>, 
	...
  filename='batch_input.jsonl', 
  bytes=1268723, 
  line_count=0, 
  processed=True, 
  FileType='jsonl')

3. Create the batch

Once you've successfully uploaded your input file, you can use the input File object's ID to create a batch. The completion window can be set to 24h. For now, the completion window defaults to 24h and cannot be changed. You can also provide custom metadata.

Create the Batch

Python Copy
file_id = file_resp.id

batch = client.batches.create_batch(file_id, endpoint="/v1/chat/completions")

print(batch.id)

This request will return a Batch object with metadata about your batch:

JSON Copy
{
  "id": "batch-xyz789",
  "status": "VALIDATING",
  "endpoint": "/v1/chat/completions",
  "input_file_id": "file-abc123",
  "created_at": "2024-01-15T10:00:00Z",
  "request_count": 0,
  "model_id": null
}

4. Check the status of a batch

You can check the status of a batch at any time, which will return updated batch information.

Check the status of a batch

Python Copy
batch_stat = client.batches.get_batch(batch.id)

print(batch_stat.status)

The status of a given Batch object can be any of the following:

Status Description
VALIDATING The input file is being validated before the batch can begin
IN_PROGRESS Batch is in progress
COMPLETED Batch processing completed successfully
FAILED Batch processing failed
EXPIRED Batch exceeded deadline
CANCELLED Batch was cancelled

5. Retrieve the results

Once the batch is complete, you can download the output by making a request to retrieve the output file using the output_file_id field from the Batch object.

Retrieving the batch results

Python Copy
from bytecompute import bytecompute

client = bytecompute()

## Get the batch status to find output_file_id
batch = client.batches.get_batch('batch-xyz789')

if batch.status == 'COMPLETED':
    # Download the output file
    client.files.retrieve_content(id=batch_stat.output_file_id, output="batch_output.jsonl")

The output .jsonl file will have one response line for every successful request line in the input file. Any failed requests will have their error information in a separate error file accessible via error_file_id.

Note that the output line order may not match the input line order. Use the custom_id field to map requests to results.

6. Get a list of all batches

At any time, you can see all your batches.

Getting a list of all batches

Python Copy
from bytecompute import bytecompute

client = bytecompute()

## List all batches
batches = client.batches.list_batches()

for batch in batches:
    print(batch)

Model availability

The following models are supported for batch processing:

Model ID Size
deepseek-ai/DeepSeek-R1 685B
deepseek-ai/DeepSeek-V3 671B
meta-llama/Llama-3-70b-chat-hf 70B
meta-llama/Llama-3.3-70B-Instruct-Turbo 70B
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 17B
meta-llama/Llama-4-Scout-17B-16E-Instruct 17B
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo 405B
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo 70B
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo 8B
mistralai/Mistral-7B-Instruct-v0.1 7B
mistralai/Mixtral-8x7B-Instruct-v0.1 8x7B
Qwen/Qwen2.5-72B-Instruct-Turbo 72B
Qwen/Qwen2.5-7B-Instruct-Turbo 7B
Qwen/Qwen3-235B-A22B-fp8-tput 235B
Qwen/QwQ-32B 32B

Rate limits

Batch API rate limits are separate from existing per-model rate limits. The Batch API has specific rate limits:

  • Max Token limits: A maximum of 10M tokens can be enqueued per model
  • Per-batch limits: A single batch may include up to 50,000 requests
  • Batch file size: Maximum 100MB per batch input file
  • Separate pool: Batch API usage doesn't consume tokens from standard rate limits

Error handling

When errors occur during batch processing, they are recorded in a separate error file accessible via the error_file_id field. Common error codes include:

Error Code Description Solution
400 Invalid request format Check JSONL syntax and required fields
401 Authentication failed Verify API key
404 Batch not found Check batch ID
429 Rate limit exceeded Reduce request frequency
500 Server error Retry with exponential backoff

Error File Format:

JSON Copy
{"custom_id": "req-1", "error": {"message": "Invalid model specified", "code": "invalid_model"}}
{"custom_id": "req-5", "error": {"message": "Request timeout", "code": "timeout"}}

Batch expiration

Batches that do not complete within the 24-hour window will move to an EXPIRED state. Unfinished requests are cancelled, and completed requests are made available via the output file. You will only be charged for tokens consumed from completed requests. Batches are best effort completion within 24 hours.

Best practices

Optimal Batch Size

  • Aim for 1,000-10,000 requests per batch for best performance
  • Maximum 50,000 requests per batch
  • Keep file size under 100MB

Error Handling

  • Always check the error_file_id for partial failures
  • Implement retry logic for failed requests
  • Use unique custom_id values for easy tracking

Model Selection

  • Choose models based on your quality/cost requirements
  • Smaller models (7B-17B) for simple tasks
  • Larger models (70B+) for complex reasoning

Request Formatting

  • Validate JSON before submission
  • Use consistent schema across requests
  • Include all required fields

Monitoring

  • Poll status endpoint every 30-60 seconds
  • Set up notifications for completion (if available)

FAQ

Q: How long do batches take to complete?
A: Processing time depends on batch size and model complexity. Most batches typically complete within 1-12 hours, but can take up to 24 hours (or only partially complete within 24 hours) depending on inference capacity.

Q: Can I cancel a running batch?
A: Currently, batches cannot be cancelled once processing begins.

Q: What happens if my batch exceeds the deadline?
A: The batch will be marked as EXPIRED and partial results may be available.

Q: Are results returned in the same order as requests?
A: No, results may be in any order. Use custom_id to match requests with responses.

Q: Can I use the same file for multiple batches?
A: Yes, uploaded files can be reused for multiple batch jobs.