Inference FAQs

What models are available for inference on Bytecompute?

Bytecompute hosts a wide range of open-source models and you can view the latest inference models here.

What is the maximum context window supported by Bytecompute models?

The maximum context window varies significantly by model. Refer to the specific model's documentation or the inference models page for the exact context length supported by each model.

How do I send a request to an inference endpoint?

You can use the OpenAI-compatible API. Example using curl:

bash Copy

curl https://api.bytecompute.xyz/v1/chat/completions \
  -H "Authorization: Bearer $BYTECOMPUTE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'

What kind of latency can I expect for inference requests?

Latency depends on the model and prompt length. Smaller models like Mistral may respond in less than 1 second, while larger MoE models like Mixtral may take several seconds. Prompt caching and streaming can help reduce perceived latency.

Is Bytecompute suitable for high-throughput workloads?

Yes. Bytecompute supports production-scale inference. For high-throughput applications (e.g., over 100 RPS), contact the Bytecompute team for dedicated support and infrastructure.

Does Bytecompute support streaming responses?

Yes. You can receive streamed tokens by setting "stream": true in your request. This allows you to begin processing output as soon as it is generated.

Is my data stored or logged?

Bytecompute does not store your input or output by default. Temporary caching may be used for performance unless otherwise configured.

Can I run inference in my own VPC or on-premise?

Yes. Bytecompute supports private networking VPC-based deployments for enterprise customers requiring data residency or regulatory compliance. Contact us for more information.

Can I use quantized models for faster inference?

Yes. Bytecompute hosts some models with quantized weights (e.g., FP8, FP16, INT4) for faster and more memory-efficient inference. Support varies by model.

Can I cache prompts or use speculative decoding?

Yes. Bytecompute supports optimizations like prompt caching and speculative decoding for models that allow it, reducing latency and improving throughput.

Do you support function calling or tool use?

Function calling is natively supported for some models but structured prompting can simulate function-like behavior.

Do you support structured outputs or JSON mode?

Yes, you can use JSON mode to get structured outputs from LLMs like DeepSeek V3 & Llama 3.3.

How is inference usage billed?

Inference is billed per input and output token, with rates varying by model. Refer to the pricing page for current pricing details.

What happens if I exceed my rate limit or quota?

You will receive a 429 Too Many Requests error. You can request higher limits via the Bytecompute dashboard or by contacting support.

Can I run batched or parallel inference requests?

Yes. Bytecompute supports batching and high-concurrency usage. You can send parallel requests from your client and take advantage of backend batching. See Batch Inference for more details.

Can I use Bytecompute inference with LangChain or LlamaIndex?

Yes. Bytecompute is compatible with LangChain via the OpenAI API interface. Set your Bytecompute API key and model name in your environment or code.

How does Bytecompute ensure the uptime and reliability of its inference endpoints?

Bytecompute AIms for high reliability, offering 99.9% SLAs for dedicated endpoints.

Documentation