Turbo LoRA ⚡
Introducing ... Faster inference with Turbo LoRA
Turbo LoRA is a new parameter-efficient fine-tuning method developed by Predibase that marries the benefits of LoRA fine-tuning with speculative decoding to predict several tokens into the future in a single step.
This proprietary fine-tuning method builds on LoRA while also improving inference throughput (measured in token generated per second) by up to 3.5x for single requests and up to 2x for high queries per second batched workloads depending on the type of the downstream task.
Note that training jobs will take longer and are priced at 2x the standard fine-tuning pricing. (See pricing)
How to train with Turbo LoRA
See which models currently support Turbo LoRA. In the FinetuningConfig, set adapter="turbo_lora"
.
# Connect a dataset
dataset = pb.datasets.from_file("/path/tldr_dataset.csv", name="tldr_dataset")
# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)
# Start a fine-tuning job with Turbo LoRA, blocks until training is finished
adapter = pb.adapters.create(
config=FinetuningConfig(
base_model="mistral-7b",
adapter="turbo_lora",
),
dataset=dataset,
repo=repo
)
When should I use Turbo LoRA?
Turbo LoRA is ideal for applications requiring rapid and efficient generation of long output sequences, such as document summarization, machine translation, question answering systems, code generation, creative writing, etc. Turbo LoRA is not useful when fine-tuning for classification tasks because output sequences in classification tasks are typically short (< 10 output tokens).
Turbo LoRA adapters improve inference throughput gains with the volume of fine-tuning data. Typically, a few thousand rows of data is a good starting point to see very noticeable throughput improvements.