Documentation

Deploying a Fine-tuned Model

Once your fine-tune job completes, you should see your new model in your models dashboard.

To use your model, you can either:

  1. Host it on Bytecompute AI as a dedicated endpoint(DE) for an hourly usage fee
  2. Run it immediately if the model supports Serverless LoRA Inference
  3. Download your model and run it locally

Hosting your model on Bytecompute AI

If you select your model in the models dashboard you can click CREATE DEDICATED ENDPOINT to create a dedicated endpoint for the fine-tuned model.

Once it's deployed, you can use the ID to query your new model using any of our APIs:

shell CLI Copy
bytecompute chat.completions \
  --model "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17" \
  --message "user" "What are some fun things to do in New York?"
python Python Copy
import os
from bytecompute import bytecompute

client = bytecompute(api_key=os.environ.get("bytecompute_API_KEY"))

stream = client.chat.completions.create(
  model="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
  messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
  stream=True,
)

for chunk in stream:
  print(chunk.choices[0].delta.content or "", end="", flush=True)
typescript TypeScript Copy
import bytecompute from "bytecompute-ai";

const bytecompute = new bytecompute({
  apiKey: process.env["bytecompute_API_KEY"]
});

const stream = await bytecompute.chat.completions.create({
  model: "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
  messages: [{ role: "user", content: "What are some fun things to do in New York?" }],
  stream: true
});

for await (const chunk of stream) {
  // use process.stdout.write instead of console.log to avoid newlines
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Hosting your fine-tuned model is charged per minute hosted. You can see the hourly pricing for fine-tuned model inference in the pricing table.

When you're not using the model, be sure to stop the endpoint from the the models dashboard.

Serverless LoRA Inference

If you fine-tuned the model using parameter efficient LoRA fine-tuning you can select the model in the models dashbaord and can click OPEN IN PLAYGROUND to quickly test the fine-tuned model.

You can also call the model directly just like any other model on the Bytecompute AI platform, by providing the unique fine-tuned model output_name that you can find for the specific model on the dashboard.

shell Shell Copy
MODEL_NAME_FOR_INFERENCE="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17" #from Model page or Fine-tuning page

curl -X POST https://api.bytecompute.xyz/v1/completions \
  -H "Authorization: Bearer $bytecompute_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'$MODEL_NAME_FOR_INFERENCE'",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?",
      },
    ],
    "max_tokens": 128
  }'
python Python Copy
import os
from bytecompute import bytecompute

client = bytecompute()

user_prompt = "debate the pros and cons of AI"

response = client.chat.completions.create(
    model="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
    messages=[{"role": "user","content": user_prompt,}],
    max_tokens=512,
    temperature=0.7,
)

print(response.choices[0].message.content)
typescript TypeScript Copy
import bytecompute from "bytecompute-ai";
const bytecompute = new bytecompute();

const stream = await bytecompute.chat.completions.create({
  model: "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
  messages: [{ role: "user", content: '"ebate the pros and cons of AI' }],
  stream: true
});

for await (const chunk of stream) {
  // use process.stdout.write instead of console.log to avoid newlines
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

You can even upload LoRA adapters from HuggingFace or an s3 bucket.

Running Your Model Locally

To run your model locally, first download it by calling download with your job ID:

shell CLI Copy
bytecompute fine-tuning download "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04"
python python Copy
import os
from bytecompute import bytecompute

client = bytecompute(api_key=os.environ.get("bytecompute_API_KEY"))

client.fine_tuning.download(
  id="ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04",
  output="my-model/model.tar.zst"
)
typescript TypeScript Copy
import bytecompute from "bytecompute-ai";

const client = new bytecompute({
  apiKey: process.env["bytecompute_API_KEY"]
});

await client.fineTune.download({
  ft_id: "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04",
  output: "my-model/model.tar.zst"
});

Your model will be downloaded to the location specified in output as a tar.zst file, which is an archive file format that uses the ZStandard algorithm. You'll need to install ZStandard to decompress your model.

On Macs, you can use Homebrew:

shell Shell Copy
brew install zstd
cd my-model
zstd -d model.tar.zst
tar -xvf model.tar
cd ..

Once your archive is decompressed, you should see the following set of files:

Copy
tokenizer_config.json
special_tokens_map.json
pytorch_model.bin
generation_config.json
tokenizer.json
config.json

These can be used with various libraries and languages to run your model locally. Transformers is a popular Python library for working with pretrained models, and using it with your new model looks like this:

python Python Copy
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("./my-model")

model = AutoModelForCausalLM.from_pretrained(
  "./my-model",
  trust_remote_code=True,
).to(device)

input_context = "Space Robots are"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids.to(device), max_length=128, temperature=0.7).cpu()
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)
Copy
Space Robots are a great way to get your kids interested in science. After all, they are the future!

If you see the output, your new model is working!

You now have a custom fine-tuned model that you can run completely locally, either on your own machine or on networked hardware of your choice.