Batch Inference

Learn how to use the Batch API to send asynchronous groups of requests with 50% lower costs, higher rate limits, and flexible completion windows. The service is ideal for processing jobs that don't require immediate responses.

Overview

The Batch API enables you to process large volumes of requests asynchronously at 50% lower cost compared to real-time API calls. It's perfect for workloads that don't need immediate responses such as:

Running evaluations and data analysis
Classifying large datasets
Offline summarization
Synthetic data generation
Content generation for marketing
Dataset processing and transformations

Compared to using standard endpoints directly, Batch API offers:

Better cost efficiency: 50% cost discount compared to synchronous APIs
Higher rate limits: Substantially more headroom with separate rate limit pools
Large-scale support: Process thousands of requests per batch
Flexible completion: Best-effort completion within 24 hours with progress tracking

Getting started

Note: Make sure your bytecompute version number is >1.5.13. Run pip install bytecompute --upgradeto upgrade if needed.

1. Prepare your batch file

Batches start with a .jsonl file where each line contains the details of an individual request to the API. The available endpoint is /v1/chat/completions (Chat Completions API). Each request must include a unique custom_id value, which you can use to reference results after completion. Here's an example of an input file with 2 requests:

JSON Copy

{"custom_id": "request-1", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 200}}
{"custom_id": "request-2", "body": {"model": "deepseek-ai/DeepSeek-V3", "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 200}}

Each line in your batch file must follow this schema:

Field	Type	Required	Description
`custom_id`	string	Yes	Unique identifier for tracking (max 64 chars)
`body`	object	Yes	The request body matching the endpoint's schema

2. Upload your batch input file

You must first upload your input file so that you can reference it correctly when creating batches. Upload your .jsonl file using the Files API with purpose=batch-api.

Upload files for Batch API

Python Copy

from bytecompute import bytecompute

client = bytecompute()

## Uploads batch job file
file_resp = client.files.upload(file="batch_input.jsonl", purpose="batch-api")

SH Copy

bytecompute files upload batch_input.jsonl --purpose "batch-api"

This will return a file object with id and other details:

Python Copy

FileResponse(
  id='file-fa37fdce-89cb-414b-923c-2add62250155', 
  object=<ObjectType.File: 'file'>, 
	...
  filename='batch_input.jsonl', 
  bytes=1268723, 
  line_count=0, 
  processed=True, 
  FileType='jsonl')

3. Create the batch

Once you've successfully uploaded your input file, you can use the input File object's ID to create a batch. The completion window can be set to 24h. For now, the completion window defaults to 24h and cannot be changed. You can also provide custom metadata.

Create the Batch

Python Copy

file_id = file_resp.id

batch = client.batches.create_batch(file_id, endpoint="/v1/chat/completions")

print(batch.id)

This request will return a Batch object with metadata about your batch:

JSON Copy

{
  "id": "batch-xyz789",
  "status": "VALIDATING",
  "endpoint": "/v1/chat/completions",
  "input_file_id": "file-abc123",
  "created_at": "2024-01-15T10:00:00Z",
  "request_count": 0,
  "model_id": null
}

4. Check the status of a batch

You can check the status of a batch at any time, which will return updated batch information.

Check the status of a batch

Python Copy

batch_stat = client.batches.get_batch(batch.id)

print(batch_stat.status)

The status of a given Batch object can be any of the following:

Status	Description
`VALIDATING`	The input file is being validated before the batch can begin
`IN_PROGRESS`	Batch is in progress
`COMPLETED`	Batch processing completed successfully
`FAILED`	Batch processing failed
`EXPIRED`	Batch exceeded deadline
`CANCELLED`	Batch was cancelled

5. Retrieve the results

Once the batch is complete, you can download the output by making a request to retrieve the output file using the output_file_id field from the Batch object.

Retrieving the batch results

Python Copy

from bytecompute import bytecompute

client = bytecompute()

## Get the batch status to find output_file_id
batch = client.batches.get_batch('batch-xyz789')

if batch.status == 'COMPLETED':
    # Download the output file
    client.files.retrieve_content(id=batch_stat.output_file_id, output="batch_output.jsonl")

The output .jsonl file will have one response line for every successful request line in the input file. Any failed requests will have their error information in a separate error file accessible via error_file_id.

Note that the output line order may not match the input line order. Use the custom_id field to map requests to results.

6. Get a list of all batches

At any time, you can see all your batches.

Getting a list of all batches

Python Copy

from bytecompute import bytecompute

client = bytecompute()

## List all batches
batches = client.batches.list_batches()

for batch in batches:
    print(batch)

Model availability

The following models are supported for batch processing:

Model ID	Size
deepseek-ai/DeepSeek-R1	685B
deepseek-ai/DeepSeek-V3	671B
meta-llama/Llama-3-70b-chat-hf	70B
meta-llama/Llama-3.3-70B-Instruct-Turbo	70B
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	17B
meta-llama/Llama-4-Scout-17B-16E-Instruct	17B
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo	405B
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo	70B
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo	8B
mistralai/Mistral-7B-Instruct-v0.1	7B
mistralai/Mixtral-8x7B-Instruct-v0.1	8x7B
Qwen/Qwen2.5-72B-Instruct-Turbo	72B
Qwen/Qwen2.5-7B-Instruct-Turbo	7B
Qwen/Qwen3-235B-A22B-fp8-tput	235B
Qwen/QwQ-32B	32B

Rate limits

Batch API rate limits are separate from existing per-model rate limits. The Batch API has specific rate limits:

Max Token limits: A maximum of 10M tokens can be enqueued per model
Per-batch limits: A single batch may include up to 50,000 requests
Batch file size: Maximum 100MB per batch input file
Separate pool: Batch API usage doesn't consume tokens from standard rate limits

Error handling

When errors occur during batch processing, they are recorded in a separate error file accessible via the error_file_id field. Common error codes include:

Error Code	Description	Solution
400	Invalid request format	Check JSONL syntax and required fields
401	Authentication failed	Verify API key
404	Batch not found	Check batch ID
429	Rate limit exceeded	Reduce request frequency
500	Server error	Retry with exponential backoff

Error File Format:

JSON Copy

{"custom_id": "req-1", "error": {"message": "Invalid model specified", "code": "invalid_model"}}
{"custom_id": "req-5", "error": {"message": "Request timeout", "code": "timeout"}}

Batch expiration

Batches that do not complete within the 24-hour window will move to an EXPIRED state. Unfinished requests are cancelled, and completed requests are made available via the output file. You will only be charged for tokens consumed from completed requests. Batches are best effort completion within 24 hours.

Best practices

Optimal Batch Size

Aim for 1,000-10,000 requests per batch for best performance
Maximum 50,000 requests per batch
Keep file size under 100MB

Error Handling

Always check the error_file_id for partial failures
Implement retry logic for failed requests
Use unique custom_id values for easy tracking

Model Selection

Choose models based on your quality/cost requirements
Smaller models (7B-17B) for simple tasks
Larger models (70B+) for complex reasoning

Request Formatting

Validate JSON before submission
Use consistent schema across requests
Include all required fields

Monitoring

Poll status endpoint every 30-60 seconds
Set up notifications for completion (if available)

FAQ

Q: How long do batches take to complete?
A: Processing time depends on batch size and model complexity. Most batches typically complete within 1-12 hours, but can take up to 24 hours (or only partially complete within 24 hours) depending on inference capacity.

Q: Can I cancel a running batch?
A: Currently, batches cannot be cancelled once processing begins.

Q: What happens if my batch exceeds the deadline?
A: The batch will be marked as EXPIRED and partial results may be available.

Q: Are results returned in the same order as requests?
A: No, results may be in any order. Use custom_id to match requests with responses.

Q: Can I use the same file for multiple batches?
A: Yes, uploaded files can be reused for multiple batch jobs.

Documentation

Batch Inference

Overview

Getting started

1. Prepare your batch file

2. Upload your batch input file

3. Create the batch

4. Check the status of a batch

5. Retrieve the results

6. Get a list of all batches

Model availability

Rate limits

Error handling

Batch expiration

Best practices

Optimal Batch Size

Error Handling

Model Selection

Request Formatting

Monitoring

FAQ

On this page