Skip to main content

Overview

Fine-tuning a large language model (LLM) refers to the process of further training a pre-trained model on a specific task or domain. This allows the fine-tuned model to use its broad pre-training foundation and specialize it for your specific task.

The fine-tuning process typically results in a few benefits:

  • Increased accuracy over off-the-shelf, pretrained models on the given task
  • Reduction in costs by shortening the prompts required
  • Ability to fine-tune much smaller models with comparable performance to larger models
  • Reduced hallucinations by teaching the model how to respond to different user inputs

When to Fine-tune

Here are some common use-cases for fine-tuning:

  • Tailoring the style or tone of a model to a particular use-case: Adjusting the language model's output to match specific writing styles or tones required for different applications.
  • Improving consistency for a desired output structure (e.g. JSON): Teaching the model to produce consistent formatting or output structures, such as when generating JSON data, for better integration with other downstream systems, such as backend APIs.
  • Handling multiple edge cases in a streamlined way: Refining the model to effectively address various exceptional scenarios or unusual inputs, enhancing its performance across a broader range of situations.

How to Fine-tune

There are two ways to fine-tune:

  • SDK: See our end-to-end fine-tuning example to get started with fine-tuning and prompting your adapter
  • UI: Navigate to our web UI where you can fine-tune and prompt your adapter, without writing a single line of code!

Common Questions

What factors affect training time and cost?

Fine-tuning on Predibase runs on Nvidia A100 80GB GPUs. Training time is primarily dependent on:

  • Dataset size: The larger the dataset you use, the longer it will take to train. When experimenting with fine-tuning, we recommend starting with a smaller dataset (~a few hundred examples). If the results look promising, then you can use your full dataset.
  • Model type: Training larger, complex models with more parameters will typically take longer than smaller models.

The cost on Predibase for a given fine-tuning job is industry-leading and based on a $ / per-token model that can be found here.

We use the tokenizer associated with a specific model during fine-tuning. You can estimate your own token usage (including prompt and completion) by using a tokenizer playground such as this Tokenizer Playground by Xenova on HuggingFace.

How many examples do I need to fine-tune?

The number of samples needed to produce a good fine-tuned model depends on a few factors including:

  • how much your task differs from the pretrained knowledge
  • the quality and diversity of your fine-tuning dataset
  • the size of the base model that you choose, i.e., smaller base models may require slightly larger training datasets compared to larger base models

At minimum, we recommend a few hundred examples though empirically we see customers get optimal results starting in the 1k-2k+ examples range.

What types of fine-tuning are supported?

Currently, Predibase supports instruction fine-tuning where you provide the model with explicit input and output pairs, and completions-style fine-tuning where the model is shown a part of an input example and predicts the remaining tokens. In the near future, we plan to add Direct Preference Optimization (DPO), a more efficient variant of Reinforcement Learning with Human Feedback (RLHF), for post-training alignment.

Do I need to include model-specific chat templates in the prompts in my datasets?

Experimentation shows that using model-specific chat templates significantly boosts performance. To learn more, click here.

Is there a token limit for fine-tuning a model?

We support fine-tuning on the entire context window supported by each base model. If any row of your dataset contains more tokens than the model's context window size, that row will be truncated to fit into the max sequence length. Fine-tuning with prompts longer than the max sequence length may impact performance.

For specific token limits for each model, please check out our full list of models.