Fine-Tuning
Fine-tuning a large language model refers to the process of further training the pre-trained model on a specific task or domain using a smaller dataset. The initial pre-training phase involves training a language model on a massive corpus of text data to learn general language patterns and representations. Fine-tuning, on the other hand, customizes the model to a specific task or domain by exposing it to task-specific data. By fine-tuning a large language model on a specific task, you leverage the pre-trained knowledge of the model while tailoring it to the nuances and requirements of your target task. This typically allows the model to perform better and achieve higher accuracy on the specific task compared to using the pretrained model by itself for your specific task.
The fine-tuning process typically involves the following steps:
- Task Definition: Specify the task or downstream application you want the model to perform, such as text classification, language generation, or question answering. This is typically done by curating your dataset so that the output feature is representative of the task type.
- Data Preparation: Gather or create a labeled dataset specific to your task. This dataset should be representative of the target domain or task you want the model to excel at.
- Select LLM: Pick the base LLM that you would like to fine-tune to your dataset. Huggingface lists a large variety of large language models you can choose from.
- Train your model: Fine-tune the model on your task-specific dataset by exposing it to the labeled examples. During this process, the model's parameters are updated using gradient-based optimization techniques (e.g., backpropagation) to minimize the task-specific loss function. There are two ways to fine tune: adjusting all of the weights of the model, or creating new task specific layers that will be learned during training.
Zero/Few Shot vs Fine-tuning
When you're deciding between using zero/few shot for inference vs fine-tuning your model, you can use the flow chart below to decide whether fine-tuning makes sense for you.
Here's when to use each approach:
- Zero-shot evaluation: This approach is useful when you have a pre-trained model that has been trained on a wide range of tasks and you want to make predictions on a new task for which you don't have any labeled data. Zero-shot evaluation allows you to leverage the general knowledge and understanding of the pre-trained model to make predictions without any specific fine-tuning. It works by providing the model with a description or a prompt of the task, and the model generates responses based on its pre-existing knowledge. Zero-shot evaluation is particularly useful when you have no labeled data.
- Few-shot evaluation: Few-shot learning is used when you have a limited amount of labeled data for a new task. In few-shot learning, you typically provide the model with a small number of examples (few shots) for the new task as part of the prompt of the task to give the model additional context that it can use to generate a response. Few-shot evaluation is particularly useful when you have limited labeled data.
- Fine-tuning: Fine-tuning is used when you have a large amount of labeled data for a specific task or domain. In this case, you can take a pre-trained model and further train it on your task-specific data to optimize its performance. Fine-tuning allows the model to learn task-specific features and adapt to the specific data distribution of the task at hand. It is typically more resource-intensive and time-consuming compared to zero-shot or few-shot evaluation, but it can lead to better performance when sufficient labeled data is available.
Fine-tuning Readiness: What should your dataset look like?
For fine-tuning a large language model, your data should ideally be representative of the specific task or domain you want to improve the model's performance on. Here are some considerations for the data:
- Task-specific data: The data should include examples that are relevant to the task you want the model to excel in. For example, if you want to fine-tune the model for sentiment analysis, you would need labeled data consisting of text samples along with their corresponding sentiment labels (positive, negative, neutral).
- Sufficient data volume: Fine-tuning generally requires a substantial amount of labeled data to achieve good performance. The more diverse and comprehensive your data is, the better the model can learn and generalize from it. Ideally, you should aim to have at least a few hundred labeled examples, although the exact amount may depend on the complexity of the task and the specific model architecture.
- Data format: In Predibase, we support text input features and either text or category (which can also be used for binary classification tasks) as output feature types.
Text Input, Text Output Example Data
text_input | text_output |
---|---|
"I want a recipe for spaghetti Bolognese." | { "pasta": "200g", "ground beef": "250g", "onion": "1", "garlic cloves": "2", "canned tomatoes": "400g", "tomato paste": "2 tablespoons", "olive oil": "2 tablespoons", "dried oregano": "1 teaspoon", "dried basil": "1 teaspoon", "salt": "to taste", "black pepper": "to taste", "Parmesan cheese": "grated, for garnish" } |
"Can you give me the ingredients for a chocolate cake?" | { "flour": "250g", "sugar": "200g", "cocoa powder": "50g", "baking powder": "2 teaspoons", "baking soda": "1 teaspoon", "salt": "1/2 teaspoon", "eggs": "2", "milk": "250ml", "vegetable oil": "125ml", "vanilla extract": "2 teaspoons", "boiling water": "250ml", "powdered sugar": "for dusting" } |
"What are the ingredients for a chicken stir-fry?" | { "chicken breast": "400g", "vegetable oil": "2 tablespoons", "garlic cloves": "2", "ginger": "1 tablespoon", "onion": "1", "bell pepper": "1", "carrot": "1", "broccoli florets": "1 cup", "soy sauce": "2 tablespoons", "oyster sauce": "1 tablespoon", "sesame oil": "1 teaspoon", "cornstarch": "1 tablespoon", "salt": "to taste", "black pepper": "to taste" } |
"I need the ingredients for a Greek salad." | { "lettuce": "1 head", "cucumber": "1", "tomatoes": "2", "red onion": "1/4", "kalamata olives": "1/2 cup", "feta cheese": "100g", "extra-virgin olive oil": "2 tablespoons", "lemon juice": "1 tablespoon", "dried oregano": "1 teaspoon", "salt": "to taste", "black pepper": "to taste" } |
"Give me the ingredients for a vegetarian curry." | { "potatoes": "2", "carrots": "2", "cauliflower": "1/2 head", "green beans": "100g", "onion": "1", "garlic cloves": "2", "ginger": "1 tablespoon", "canned tomatoes": "400g", "coconut milk": "400ml", "curry powder": "2 tablespoons", "turmeric": "1 teaspoon", "cumin": "1 teaspoon", "coriander": "1 teaspoon", "chili powder": "1/2 teaspoon", "salt": "to taste", "vegetable oil": "2 tablespoons" } |
Text Input, Category Output Example
text_input | category_label |
---|---|
"I love this movie!" | "positive" |
"This restaurant has amazing food." | "positive" |
"The customer service was terrible." | "negative" |
"I'm not sure if I like it." | "neutral" |
"The weather is perfect today." | "positive" |
Text Input, Binary Output Example
text_input | binary_label |
---|---|
"I love this movie!" | 1 |
"This restaurant has amazing food." | 1 |
"The customer service was terrible." | 0 |
"I'm not sure if I like it." | 0 |
"The weather is perfect today." | 1 |
How to fine-tune your LLMs in Predibase
To train a new model through the UI, first click + New Model Repository
. You'll be prompted to specify a name of the model repository and optionally provide a description. We recommend using descriptive names that describe the ML use-case you're trying to solve.
Once you've entered this information, you'll be taken directly to the Model Builder to train your first model.
Model Builder Page 1
Train a single model version. You can start with our defaults to establish a baseline, or customize any parameter in the model pipeline, from preprocessing to training parameters.
Next you'll be asked to provide a model version description, connection (system you want to pull data in from), dataset, and target (output feature or column you want to predict). At the moment, Predibase only supports one text input feature and one output feature (either text or category) during fine-tuning of large language models. Support for multiple input and multiple output features for large language models is in our roadmap and will be in the platform soon.
Next, select the Large Language Model model type to fine-tune your LLM.
At the bottom of the model builder page, you will see a summary view of the features in your dataset. If your dataset has more than 1 text feature, you should disable all of those additional features from being used using the Active
toggle on the right side.
Model Builder Page 2
Now we'll cover the configuration options available after you click Next
and land on the second page of the Model Builder.
Model Graph
Also known as the butterfly diagram, the model graph shows a visual diagram of the training/evaluation pipeline. Each individual component is viewable and modifiable, providing a truly "glass-box" approach to ML development.
Parameters
This section contains parameters that users may consider modifying while tweaking models.
Model Name:
To fine-tune an LLM for your task, you need to first choose what LLM you want to use. This can be configured using the Model Name
parameter within the Large Language Model parameter sub-section. Huggingface lists a large variety of large language models you can choose from. Make sure you provide the full model-name by copying the value from the model card.
Typically, good choices for LLMs here are those that are base large language models, but instruction tuned models can also be used. Instruction-tuned models are language models that have been fine-tuned with explicit instructions or prompts during the training process. These instructions provide specific guidance to the model about how to approach a task or generate desired outputs, giving the model the capability to respond to instructions when prompted. This is typically useful when you have a small dataset for fine-tuning the large language model to your data.
Task and Template
The next step after picking your model is to configure your prompt through the Task and Template parameters. These parameters define the task you want the LLM to perform. Predibase uses default templates if you don't provide one.
For example, if you are fine-tuning one of the LLaMA models for Meta AI on Stanford's Alpaca dataset (a dataset for instruction tuning), you may define your task and template in the following way:
Task
Write a response that appropriately completes the request.
Template
Below is an instruction that describes a task. {task}
### Instruction: {sample_input}
### Response:
{sample_input}
inserts your input text feature into the template, while {task}
inserts the task you defined into the template.
If you had a row of data in your dataset that had the value Generate a poem with 10 lines.
, then your data would get transformed to the following after
preprocessing:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: Generate a poem with 10 lines.
### Response:
Predibase also lets you control other parameters for preprocessing such as the sequence length, lowercasing, etc. through Feature Type Defaults.
Generation:
The generation config refers to the set of configuration parameters that control the behavior of text generation models. These parameters specify various aspects of the generation process, including length restrictions, temperature, repetition penalty, and more. These are useful when evaluating your fine-tuned model on your validation and test sets.
Here are some common configuration parameters that can be specified in the generation config. You can control as many as you like and leave the default values we've set for the rest:
- Max Length: Sets the maximum number of tokens for the generated output, ensuring it doesn't exceed a certain number of tokens.
- Min Length: Specifies the minimum length of the generated output, ensuring it meets a certain threshold.
- Temperature: Controls the randomness of the generated text. Higher temperature values (e.g., 1.0) result in more diverse and creative outputs, while lower values (e.g., 0.2) lead to more deterministic and focused responses.
- Repetition Penalty: Discourages the model from repeating the same tokens in the generated text. Higher penalty values (e.g., 2.0) make the model more cautious about generating repetitive content.
- Num Beams: Specifies the number of beams used in beam search decoding. Increasing the number of beams can improve the diversity of generated outputs but may also increase computation time.
By adjusting these generation configuration parameters, you can control the trade-off between creativity and coherence in the generated text, prevent common issues like repetition, and tailor the model's behavior according to the specific requirements of your application.
Adapter:
When fine-tuning a model, you generally have two options for updating the weights:
- Fine-tuning all the weights: This approach involves updating all the weights of your pre-trained LLM during the fine-tuning process. Initially, the base model is trained on a large corpus of text for natural language processing. Then, for a specific task or domain, you continue training the model on a smaller dataset that is more relevant to the target task. During this fine-tuning phase, all the weights of the model are updated using the new dataset. This process allows the model to adapt to the specific characteristics and nuances of the target task.
- Using adapters for parameter-efficient fine-tuning: Adapters provide an alternative approach to fine-tuning that aims to minimize the number of trainable parameters and reduce the risk of catastrophic forgetting. Instead of updating all the weights of the model, adapters introduce a new, smaller set of task-specific parameters that are added to the pre-trained model. These task-specific adapters are attached to the pre-trained model's intermediate layers, allowing for fine-tuning on a specific task while keeping the majority of the original model's parameters frozen (make sure they are unchanged). The adapter architecture is typically lightweight and modular, making it easier to add or remove adapters for different tasks without significant modifications to the pre-trained model's architecture.
The key advantage of using adapters is parameter efficiency. By introducing task-specific adapters, you can fine-tune the model for multiple tasks without modifying the original weights significantly. This approach reduces the computational cost and memory requirements compared to fine-tuning all the weights, especially when dealing with large-scale models. Additionally, using adapters helps mitigate the risk of catastrophic forgetting, as the original model's parameters remain mostly unchanged.
The choice between fine-tuning all the weights and using adapters depends on the specific requirements of your task, available computational resources, and the trade-off between model complexity and performance. Fine-tuning all the weights is more flexible and can potentially achieve better performance if you have sufficient resources, while using adapters provides a more parameter-efficient approach that is particularly beneficial when dealing with resource constraints.
If you want to fine-tune all the weights of the model, you can skip setting an adapter by setting the adapter type to None
.
If you want to use an adapter for parameter efficient fine-tuning, Predibase supports a wide variety of strategies with even more options coming in the near future. Currently, Predibase supports the following adapter types:
- LoRA: This strategy adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains those newly added weights. These rank-decomposition weight matrices are much smaller in size than the original weight matrices. LoRA matrices are generally added to the attention layers of the original model.
- AdaLoRA: AdaLoRA builds upon the LoRA approach and dynamically determines the importance of each layer for a specific task. It uses a gating mechanism to decide whether an adapter should be added to a layer based on its relevance to the task. This adaptivity allows AdaLoRA to allocate resources more efficiently and potentially achieve better performance with fewer adapters.
- Adaption Prompt: This method prepends tunable prompt tensors to the embedded inputs.
You can choose to configure any of the parameter for these strategies, or use the default values we've already defined for you.
We suggest using a parameter-efficient fine tuning technique, like LoRA, for your first experiences fine tuning for a task.
Trainer:
Since we want to fine-tune our base LLM to our data, we select the finetune
trainer from the dropdown.
After this, you can configure a variety of training related parameters, such as learning rate, batch size, etc.
That's it! You can hit train to start a batch evaluation on your dataset!
Metrics
When you fine-tune your large language model for a text to text generation task, you will see the following metrics:
- Loss: The loss metric value is a measure of how well the generated text matches the desired target text. The loss metric quantifies the dissimilarity between the generated output and the ground truth target text by calculating the discrepancy between them. The loss value is computed using cross-entropy loss between the predicted output and the ground truth output.
- Perplexity: Perplexity measures how well a language model predicts a given sequence of tokens. It gives us an idea of how surprised or confused the model is when it encounters new text. Lower perplexity values indicate better performance. Perplexity is calculated using the likelihood of the target sequence according to the model's learned probability distribution. A lower perplexity suggests that the model is more certain and accurate in predicting the next token in a sequence.
- Token Accuracy: Token accuracy measures the percentage of correctly predicted tokens in the generated text compared to the ground truth. It calculates the ratio of correctly generated tokens to the total number of tokens. Token accuracy provides an understanding of how well the model generates individual tokens, irrespective of their position in the sequence.
- Sequence Accuracy: Sequence accuracy evaluates the correctness of the entire generated sequence compared to the ground truth. It measures the percentage of generated sequences that exactly match the desired target sequences. Sequence accuracy is particularly relevant when generating sequences that require strict adherence to a desired output format or structure.
- Character Error Rate: Character Error Rate (CER) is a metric commonly used to evaluate the quality of text generation models, such as speech recognition systems or optical character recognition (OCR) systems. It measures the percentage of incorrectly predicted characters compared to the ground truth. CER is calculated by aligning the predicted and ground truth sequences and counting the number of substitutions, insertions, and deletions required to transform one sequence into the other.
In an ideal scenario, your fine-tuned model performs well on your validation and test sets on the metrics above. You want to ensure that your model is not overfitting on your data during training, and that it starts to converge after a few epochs of fine-tuning.