Chat

You can use bytecompute's APIs to send individual queries or have long-running conversations with chat models. You can also configure a system prompt to customize how a model should respond.

Queries run against a model of your choice. For most use cases, we recommend using Meta Llama 3.

Running a single query

Use chat.completions.create to send a single query to a chat model:

Python Copy

from bytecompute import bytecompute

client = bytecompute()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct-Reference",
    messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
)

print(response.choices[0].message.content)

TypeScript Copy

import bytecompute from "bytecompute-ai";

const bytecompute = new bytecompute();

const response = await bytecompute.chat.completions.create({
  model: "meta-llama/Meta-Llama-3-8B-Instruct-Reference",
  messages: [{ role: "user", content: "What are some fun things to do in New York?" }],
});

console.log(response.choices[0].message.content)

SH Copy

curl -X POST "https://api.bytecompute.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $bytecompute_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Meta-Llama-3-8B-Instruct-Reference",
     	"messages": [
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	]
     }'

The create method takes in a model name and a messages array. Each message is an object that has the content of the query, as well as a role for the message's author.

In the example above, you can see that we're using "user" for the role. The "user" role tells the model that this message comes from the end user of our system ??for example, a customer using your chatbot app.

The other two roles are "assistant" and "system", which we'll talk about next.

Having a long-running conversation

Every query to a chat model is self-contained. This means that new queries won't automatically have access to any queries that may have come before them. This is exactly why the "assistant" role exists.

The "assistant" role is used to provide historical context for how a model has responded to prior queries. This makes it perfect for building apps that have long-running conversations, like chatbots.

To provide a chat history for a new query, pass the previous messages to the messages array, denoting the user-provided queries with the "user" role, and the model's responses with the "assistant" role:

Python Copy

import os
from bytecompute import bytecompute

client = bytecompute()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct-Reference",
    messages=[
      {"role": "user", "content": "What are some fun things to do in New York?"},
      {"role": "assistant", "content": "You could go to the Empire State Building!"},
      {"role": "user", "content": "That sounds fun! Where is it?"},
    ],
)

print(response.choices[0].message.content)

TypeScript Copy

import bytecompute from "bytecompute-ai";

const bytecompute = new bytecompute();

const response = await bytecompute.chat.completions.create({
  model: "meta-llama/Meta-Llama-3-8B-Instruct-Reference",
  messages: [
    { role: "user", content: "What are some fun things to do in New York?" },
    { role: "assistant", content: "You could go to the Empire State Building!"},
    { role: "user", content: "That sounds fun! Where is it?" },
  ],
});

console.log(response.choices[0].message.content);

SH Copy

curl -X POST "https://api.bytecompute.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $bytecompute_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Meta-Llama-3-8B-Instruct-Reference",
     	"messages": [
        {"role": "user", "content": "What are some fun things to do in New York?"},
        {"role": "assistant", "content": "You could go to the Empire State Building!"},
        {"role": "user", "content": "That sounds fun! Where is it?" }
     	]
     }'

How your app stores historical messages is up to you.

Customizing how the model responds

While you can query a model just by providing a user message, typically you'll want to give your model some context for how you'd like it to respond. For example, if you're building a chatbot to help your customers with travel plans, you might want to tell your model that it should act like a helpful travel guide.

To do this, provide an initial message that uses the "system" role:

Python Copy

import os
from bytecompute import bytecompute

client = bytecompute()

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[
      {"role": "system", "content": "You are a helpful travel guide."},
      {"role": "user", "content": "What are some fun things to do in New York?"},
    ],
)

print(response.choices[0].message.content)

TypeScript Copy

import bytecompute from "bytecompute-ai";

const bytecompute = new bytecompute();

const response = await bytecompute.chat.completions.create({
  model: "meta-llama/Llama-3-8b-chat-hf",
  messages: [
    {"role": "system", "content": "You are a helpful travel guide."},
    { role: "user", content: "What are some fun things to do in New York?" },
  ],
});

console.log(response.choices[0].message.content);

SH Copy

curl -X POST "https://api.bytecompute.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $bytecompute_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Llama-3-8b-chat-hf",
     	"messages": [
     		{"role": "system", "content": "You are a helpful travel guide."},
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	]
     }'

Streaming responses

Since models can take some time to respond to a query, bytecompute's APIs support streaming back responses in chunks. This lets you display results from each chunk while the model is still running, instead of having to wait for the entire response to finish.

To return a stream, set the stream option to true.

Python Copy

import os
from bytecompute import bytecompute

client = bytecompute()

stream = client.chat.completions.create(
  model="meta-llama/Llama-3-8b-chat-hf",
  messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
  stream=True,
)

for chunk in stream:
  print(chunk.choices[0].delta.content or "", end="", flush=True)

TypeScript Copy

import bytecompute from 'bytecompute-ai';

const bytecompute = new bytecompute();

const stream = await bytecompute.chat.completions.create({
  model: 'meta-llama/Llama-3-8b-chat-hf',
  messages: [
    { role: 'user', content: 'What are some fun things to do in New York?' },
  ],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

SH Copy

curl -X POST "https://api.bytecompute.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $bytecompute_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Llama-3-8b-chat-hf",
     	"messages": [
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	],
      "stream": true
     }'
     
## Response will be a stream of Server-Sent Events with JSON-encoded payloads. For example:
## 
## data: {"choices":[{"index":0,"delta":{"content":" A"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":330,"text":" A","logprob":1,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}
## data: {"choices":[{"index":0,"delta":{"content":":"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":28747,"text":":","logprob":0,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}
## data: {"choices":[{"index":0,"delta":{"content":" Sure"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":12875,"text":" Sure","logprob":-0.00724411,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}

A note on async support in Python

Since I/O in Python is synchronous, multiple queries will execute one after another in sequence, even if they are independent.

If you have multiple independent calls that you want to run in parallel, you can use our Python library's Asyncbytecompute module:

Python Copy

import os, asyncio
from bytecompute import Asyncbytecompute

async_client = Asyncbytecompute()
messages = [
    "What are the top things to do in San Francisco?",
    "What country is Paris in?",
]

async def async_chat_completion(messages):
    async_client = Asyncbytecompute(api_key=os.environ.get("bytecompute_API_KEY"))
    tasks = [
        async_client.chat.completions.create(
            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
            messages=[{"role": "user", "content": message}],
        )
        for message in messages
    ]
    responses = await asyncio.gather(*tasks)

    for response in responses:
        print(response.choices[0].message.content)

asyncio.run(async_chat_completion(messages))

Documentation

Chat

Running a single query

Having a long-running conversation

Customizing how the model responds

Streaming responses

A note on async support in Python

On this page