Deploying a Fine-tuned Model
Once your fine-tune job completes, you should see your new model in your models dashboard.
To use your model, you can either:
- Host it on Bytecompute AI as a dedicated endpoint(DE) for an hourly usage fee
- Run it immediately if the model supports Serverless LoRA Inference
- Download your model and run it locally
Hosting your model on Bytecompute AI
If you select your model in the models dashboard you can click CREATE DEDICATED ENDPOINT to create a dedicated endpoint for the fine-tuned model.
Once it's deployed, you can use the ID to query your new model using any of our APIs:
shell CLI
bytecompute chat.completions \
--model "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17" \
--message "user" "What are some fun things to do in New York?"
python Python
import os
from bytecompute import bytecompute
client = bytecompute(api_key=os.environ.get("bytecompute_API_KEY"))
stream = client.chat.completions.create(
model="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
typescript TypeScript
import bytecompute from "bytecompute-ai";
const bytecompute = new bytecompute({
apiKey: process.env["bytecompute_API_KEY"]
});
const stream = await bytecompute.chat.completions.create({
model: "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
messages: [{ role: "user", content: "What are some fun things to do in New York?" }],
stream: true
});
for await (const chunk of stream) {
// use process.stdout.write instead of console.log to avoid newlines
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Hosting your fine-tuned model is charged per minute hosted. You can see the hourly pricing for fine-tuned model inference in the pricing table.
When you're not using the model, be sure to stop the endpoint from the the models dashboard.
Serverless LoRA Inference
If you fine-tuned the model using parameter efficient LoRA fine-tuning you can select the model in the models dashbaord and can click OPEN IN PLAYGROUND to quickly test the fine-tuned model.
You can also call the model directly just like any other model on the Bytecompute AI platform, by providing the unique fine-tuned model output_name that you can find for the specific model on the dashboard.
shell Shell
MODEL_NAME_FOR_INFERENCE="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17" #from Model page or Fine-tuning page
curl -X POST https://api.bytecompute.xyz/v1/completions \
-H "Authorization: Bearer $bytecompute_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'$MODEL_NAME_FOR_INFERENCE'",
"messages": [
{
"role": "user",
"content": "What are some fun things to do in New York?",
},
],
"max_tokens": 128
}'
python Python
import os
from bytecompute import bytecompute
client = bytecompute()
user_prompt = "debate the pros and cons of AI"
response = client.chat.completions.create(
model="[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
messages=[{"role": "user","content": user_prompt,}],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
typescript TypeScript
import bytecompute from "bytecompute-ai";
const bytecompute = new bytecompute();
const stream = await bytecompute.chat.completions.create({
model: "[email?protected]/Meta-Llama-3-8B-2024-07-11-22-57-17",
messages: [{ role: "user", content: '"ebate the pros and cons of AI' }],
stream: true
});
for await (const chunk of stream) {
// use process.stdout.write instead of console.log to avoid newlines
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
You can even upload LoRA adapters from HuggingFace or an s3 bucket.
Running Your Model Locally
To run your model locally, first download it by calling download with your job ID:
shell CLI
bytecompute fine-tuning download "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04"
python python
import os
from bytecompute import bytecompute
client = bytecompute(api_key=os.environ.get("bytecompute_API_KEY"))
client.fine_tuning.download(
id="ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04",
output="my-model/model.tar.zst"
)
typescript TypeScript
import bytecompute from "bytecompute-ai";
const client = new bytecompute({
apiKey: process.env["bytecompute_API_KEY"]
});
await client.fineTune.download({
ft_id: "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04",
output: "my-model/model.tar.zst"
});
Your model will be downloaded to the location specified in output as a tar.zst file, which is an archive file format that uses the ZStandard algorithm. You'll need to install ZStandard to decompress your model.
On Macs, you can use Homebrew:
shell Shell
brew install zstd
cd my-model
zstd -d model.tar.zst
tar -xvf model.tar
cd ..
Once your archive is decompressed, you should see the following set of files:
tokenizer_config.json
special_tokens_map.json
pytorch_model.bin
generation_config.json
tokenizer.json
config.json
These can be used with various libraries and languages to run your model locally. Transformers is a popular Python library for working with pretrained models, and using it with your new model looks like this:
python Python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
model = AutoModelForCausalLM.from_pretrained(
"./my-model",
trust_remote_code=True,
).to(device)
input_context = "Space Robots are"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(input_ids.to(device), max_length=128, temperature=0.7).cpu()
output_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_text)
Space Robots are a great way to get your kids interested in science. After all, they are the future!
If you see the output, your new model is working!
You now have a custom fine-tuned model that you can run completely locally, either on your own machine or on networked hardware of your choice.
