Deep dive into parameter efficient fine-tuning with adapters
Adapters are a powerful way to fine-tune large language models for specific
tasks by introducing parameter-efficient methods that customize base models
using your data without changing the original model architecture. They enable
faster training, reduce the risk of overfitting, and allow for rapid
experimentation.
Unlike traditional fine-tuning that updates all model weights, adapters
introduce a small number of task-specific parameters while keeping the base
model frozen. This approach offers several key benefits:
Efficiency: Training is faster and requires fewer compute resources
Reduced Risk: By updating fewer parameters, adapters reduce the risk of
catastrophic forgetting
Flexibility: Multiple adapters can be trained and deployed on a single
base model
Storage: Adapter checkpoints are much smaller than full model checkpoints
LoRA introduces trainable low-rank matrices
into the model’s layers to capture task-specific adjustments. Here’s how it
works:
Architecture: LoRA inserts two matrices (A and B) into each target linear layer
within each transformer layer (attention or MLP or both):
Matrix A projects the original weight space (hidden size) to a lower dimension (called the rank)
Matrix B projects it back to the original size
Parameter Efficiency: LoRA typically uses only 0.1% to 1% of the
parameters in the base model - this is dependent on the rank and the number of
layers that are targeted.
Turbo LoRA is a proprietary method developed by Predibase that marries the benefits
of LoRA fine-tuning (for quality) with speculative decoding (for speed) to enhance
inference throughput (measured in token generated per second) by up to 3.5x for
single requests and up to 2x for high queries per second batched workloads depending
on the type of the downstream task. Instead of just predicting one token at a time,
speculative decoding allows the model to predict and verify several tokens into the
future in a single decoding step, significantly accelerating the generation process,
making it well-suited for tasks involving long output sequences.
Turbo is a specialized adapter focused purely on inference speed through
speculative decoding. It’s based on the
Medusa architecture but with 10x fewer
parameters.It trains additional layers to predict multiple tokens in parallel, making it
useful when you want to accelerate an existing base model or an already fine-tuned
LoRA adapter. Note that Turbo adapters don’t train a LoRA, so none of the LoRA specific
parameters like rank, alpha, dropout and target modules apply.For a hands-on example of how to train a Turbo adapter, see this Colab notebook.