Skip to main content

Turbo LoRA ⚡

Adapters are a powerful way to fine-tune large language models for specific tasks by introducing parameter-efficient methods that customize base models using your data without changing the original model architecture. They enable faster training, reduce the risk of overfitting, and allow for rapid experimentation.

Turbo LoRA is a proprietary fine-tuning method that combines LoRA for quality and speculative decoding for speed, achieving up to 3.5x faster inference for single requests and up to 2x for high-query-batch workloads.

At Predibase, we support 3 different adapters:

  • LoRA
  • Turbo LoRA
  • Turbo

Adapter Types

LoRA

LoRA (Low-Rank Adaptation) introduces small trainable low-rank matrices into the model's layers. Instead of updating the entire set of model weights, LoRA inserts these matrices, known as LoRA A and LoRA B, to capture task-specific adjustments. LoRA A projects the original weight space to a lower dimension, while LoRA B expands it back to the original size, allowing for efficient fine-tuning with significantly fewer trainable parameters, usually between 0.1% to 1% of parameters in the base model. This approach allows for faster fine-tuning while maintaining the original model's capabilities.

Predibase supports LoRA adapters for all base models.

How to train with LoRA

# Connect a dataset
dataset = pb.datasets.from_file("/path/tldr_dataset.csv", name="tldr_dataset")

# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job with LoRA
adapter = pb.adapters.create(
config=FinetuningConfig(
base_model="mistral-7b",
adapter="lora",
),
dataset=dataset,
repo=repo
)

Turbo LoRA

Turbo LoRA is a new proprietary parameter-efficient fine-tuning method developed by Predibase that marries the benefits of LoRA fine-tuning (for quality) with speculative decoding (for speed) to enhance inference throughput (measured in token generated per second) by up to 3.5x for single requests and up to 2x for high queries per second batched workloads depending on the type of the downstream task.

Instead of just predicting one token at a time, speculative decoding allows the model to predict and verify several tokens into the future in a single decoding step, significantly accelerating the generation process, making it well-suited for tasks involving long output sequences.

Note that training jobs will take longer and are priced at 2x the standard fine-tuning pricing. (See pricing) or Read our blog to learn more.

See which models currently support Turbo LoRA.

How to train with Turbo LoRA

In the FinetuningConfig, set adapter="turbo_lora".

# Connect a dataset
dataset = pb.datasets.from_file("/path/tldr_dataset.csv", name="tldr_dataset")

# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job with Turbo LoRA
adapter = pb.adapters.create(
config=FinetuningConfig(
base_model="mistral-7b",
adapter="turbo_lora",
),
dataset=dataset,
repo=repo
)

How to serve a Turbo LoRA Adapter

While any base model can be trained as a Turbo LoRA adapter, some models require additional deployment configurations to support adapter inference properly.

If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints). See model requirements here.

When should I use Turbo LoRA?

Turbo LoRA is ideal for applications requiring rapid and efficient generation of long output sequences, such as document summarization, machine translation, question answering systems, code generation, creative writing, etc. Turbo LoRA is not useful when fine-tuning for classification tasks because output sequences in classification tasks are typically short (< 5 output tokens).

Turbo LoRA adapters improve inference throughput gains with the volume of fine-tuning data. Typically, a few thousand rows of data is a good starting point to see very noticeable throughput improvements.

Turbo (New)

The turbo adapter is used to fine-tune a speculator directly on top of the base model, creating a highly parameter-efficient model optimized for speculative decoding to enhance inference throughput (measured in token generated per second). It improves inference speed using your original model by 2x and 3x without affecting it's output.

We designed our turbo adapter to be 100x more parameter efficient compared to other fine-tuning techniques for speculative decoding, making it the ideal choice for accelerating token generation tasks.

The turbo adapter can be used to:

  1. Improve Base Model Inference Speed: Fine-tune a speculator to optimize your base model's performance on specific prompts and tasks, resulting in faster generation without compromising the original model's output quality. This approach tailors the speculative decoding process to your use case, further enhancing throughput gains.
  2. Faster LoRA Inference: If you’ve already fine-tuned a model using LoRA, you can convert it into a Turbo LoRA by resuming the fine-tuning job with the adapter type set to turbo. This allows you to leverage speculative decoding on top of your existing fine-tuned model, boosting inference speed without affecting your LoRA adapter's quality.

How to train a Turbo adapter For Base Model Throughput Improvement

# Connect a dataset
dataset = pb.datasets.from_file("/path/tldr_dataset.csv", name="tldr_dataset")

# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job with LoRA
adapter = pb.adapters.create(
config=FinetuningConfig(
base_model="mistral-7b",
adapter="turbo",
),
dataset=dataset,
repo=repo
)

How to train a Turbo Adapter for faster LoRA Inference

# Connect a dataset
dataset = pb.datasets.from_file("/path/tldr_dataset.csv", name="tldr_dataset")

# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job with LoRA
adapter = pb.adapters.create(
config=FinetuningConfig(
adapter="turbo",
),
continue_from_version="myrepo/3",
dataset=dataset,
repo=repo
)

How to serve a Turbo Adapter

While any base model can be trained as a Turbo adapter, some models require additional deployment configurations to support adapter inference properly.

If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo adapter as normal (via private deployments and shared endpoints). See model requirements here.