Adapters
Parameter efficient fine-tuning with adapters
Adapters are a parameter-efficient way to fine-tune your model on a new task. Instead of updating all the weights in the model, you can add a small number of task-specific parameters to the model.
Predibase provides different adapter types that can be used to improve model performance on a specific task (LoRA), speed up model throughput with speculative decoding (Turbo), or both (Turbo LoRA).
Create an Adapter
Creating a fine-tuned adapter in Predibase requires three things:
- A dataset that contains examples of the task you want to fine-tune the model on.
- An adapter repository to store and track your adapter versions (experiments).
- A configuration object that specifies the hyperparameters for the fine-tuning job.
Python SDK
Fine-tuning jobs can be started using the Predibase Python SDK:
Synchronous Fine-Tuning API
Once the training job is created, it typically spends 3-5 minutes waiting in a queue for compute resources to become available and assigned to the job. After this, the job will begin training, and the training logs will be streamed to your console.
Asynchronous Fine-Tuning API
By default, creating an adapter is a blocking (synchronous) call. To create an adapter asynchronously, you can use the Jobs API:
Web UI
In the Predibase UI, you can start a fine-tuning job by navigating to the Adapters tab, creating a repository, and then creating a new version within the repository.
The UI supports most, but not all, of the parameters available within the Python SDK. For more advanced use cases, we recommend using the Python SDK.
For information about the different types of fine-tuning tasks supported (instruction fine-tuning, text completion, chat), see our Tasks guide.
Adapter Types
LoRA
LoRA (Low-Rank Adaptation) is the default adapter type that improves model quality and alignment with your task. It trains a small set of low-rank matrices while keeping the base model frozen, making it highly efficient.
To start fine-tuning with LoRA:
Turbo LoRA
Turbo LoRA is a proprietary method developed by Predibase that marries the benefits of LoRA fine-tuning (for quality) with speculative decoding (for speed) to enhance inference throughput (measured in token generated per second) by up to 3.5x for single requests and up to 2x for high queries per second batched workloads depending on the type of the downstream task. Instead of just predicting one token at a time, speculative decoding allows the model to predict and verify several tokens into the future in a single decoding step, significantly accelerating the generation process, making it well-suited for tasks involving long output sequences.
To start fine-tuning with Turbo LoRA:
Turbo
Turbo is a speculative decoding adapter that speeds up inference without changing the model’s outputs. It trains additional layers to predict multiple tokens in parallel, making it useful when you want to accelerate an existing base model or an already fine-tuned LoRA adapter. Note that Turbo adapters don’t train a LoRA, so none of the LoRA specific parameters like rank, alpha, dropout and target modules apply.
To start fine-tuning with Turbo:
See the Continue Training for how to continue training with an existing LoRA adapter.
When fine-tuning with any of these adapter types, you can use the
apply_chat_template
flag to automatically format your prompts with the base
model’s chat template. This is particularly useful when fine-tuning
instruction-tuned models.
Chat Templates
Open source models come in base (e.g., Llama-3-8B) and instruct (e.g., Llama-3-8B-Instruct) versions. Instruct versions are trained with chat templates that provide consistent instruction formatting. Using model-specific chat templates typically improves fine-tuning performance, especially when fine-tuning instruction-tuned models.
Using Chat Templates
For fine-tuning, apply_chat_template
is supported in the SFTConfig:
When this parameter is set to True
, each training sample in the dataset will
automatically have the model’s chat template applied to it. Note that this
parameter is only supported for instruction and chat fine-tuning, not continued
pretraining.
Inference with a Chat Template
If your model was trained with apply_chat_template
set to True
, please use
only the
OpenAI-compatible method to
query the model because the chat template will automatically be applied to your
inputs. You can see sample code in the
Python SDK example.
Next Steps
- Learn how to continue training from existing adapters or checkpoints