Skip to main content

FinetuningConfig

Below is the class definition for the Finetuning Config. It inherits from the BaseModel defined in Pydantic.

class FinetuningConfig(BaseModel):
base_model: str
adapter: str | None = Field(default=None)
task: str | None = Field(default=None)
epochs: PositiveInt | None = Field(default=None)
learning_rate: PositiveFloat | None = Field(default=None)
rank: PositiveInt | None = Field(default=None)
target_modules: List[str] | None = Field(default=None)
enable_early_stopping: bool = Field(default=True)
lr_scheduler: dict | None = Field(default=None)
optimizer: dict | None = Field(default=None)
lora_alpha: PositiveInt | None = Field(default=None)
lora_dropout: PositiveFloat | None = Field(default=None)
warmup_ratio: PositiveFloat | None = Field(default=None)
effective_batch_size: PositiveInt | None = Field(default=None)

By default, Predibase sets the following default values (subject to change):

  • epochs: 3
  • adapter: "lora"
  • task: "instruction_tuning"
  • learning_rate: 0.0002
  • rank: 16
  • target_modules: see below in the "Target Modules" section
  • enable_early_stopping: enable early stopping of training if validation loss plateaus over 10 successive evaluation checkpoints
  • lr_scheduler: {"type": "cosine_with_restarts", "params": {}}
  • optimizer: {"type": "adamw", "params": {}}
  • lora_alpha: None
  • lora_dropout: 0
  • warmup_ratio: 0.03
  • effective_batch_size: 16
Note regarding base_model

Use the short names provided in the list of available models.

Task

The task parameter determines what the adapter will be fine-tuned to do. The following tasks are supported:

  • instruction_tuning: Given a prompt input field, the adapter is trained to generate a completion output field.
  • completion (Beta): Given a single text input field, the adapter is trained to do next token prediction on the input.

Adapter

The adapter parameter determines what type of adapter will be fine-tuned. Predibase supports two adapter types:

  • lora: LoRA is an efficient method for training LLMs that introduces a small subset of task-specific model parameters alongside the original model parameters. These weights learn from the task-specific data during fine-tuning, optimizing performance for your dataset.
  • turbo_lora (New): This is a proprietary fine-tuning method that builds on LoRA while also improving inference throughput (measured in token generated per second) by up to 3.5x for single requests and up to 2x for high queries per second batched workloads depending on the type of the downstream task. Training jobs will take longer to train and are priced at 2x the standard fine-tuning pricing.
  • turbo: Fine-tunes a speculator on top of a base model or existing LoRA adapter to enable speculative decoding, improving inference throughput by 2x–3x while being 100x more parameter-efficient. Ideal for speeding up token generation tasks and can convert an existing LoRA fine-tuning into Turbo LoRA by resuming the job with adapter="turbo".

Read more about each adapter here.

Target Modules

The target_modules parameter is a list of strings, where each string is the name of a module in the model that you want to fine-tune. The default value is None, which means that the default modules will be fine-tuned. Per base model, we have the following target modules:

  • codellama-13b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • codellama-70b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • gemma-2b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • gemma-2b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • gemma-7b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • gemma-7b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-13b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-13b-chat
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-70b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-70b-chat
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-7b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-2-7b-chat
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-3-8b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-3-8b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-3-70b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • llama-3-70b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • mistral-7b
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • mistral-7b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • mistral-7b-instruct-v0-2
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • mistral-7b-instruct-v0-3
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • mixtral-8x7b-instruct-v0-1
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
  • phi-2
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • dense
    • fc1 (default)
    • fc2 (default)
  • zephyr-7b-beta
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • phi-3-mini-4k-instruct
    • qkv_proj (default)
    • o_proj (default)
    • gate_up_proj
    • down_proj
  • phi-3-5-mini-instruct
    • qkv_proj (default)
    • o_proj (default)
    • gate_up_proj
    • down_proj
  • codellama-7b-instruct
    • q_proj (default)
    • k_proj
    • v_proj (default)
    • o_proj
    • gate_proj
    • up_proj
    • down_proj
  • codellama-7b
    - q_proj (default)
    - k_proj
    - v_proj (default)
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    -solar-pro-preview-instruct
    - q_proj (default)
    - k_proj
    - v_proj (default)
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

Optimizer

The optimizer parameter is a dictionary that specifies the optimizer to use. The default value is {"type": "adamw", "params": {}}.

In particular, the params field is a dictionary that specifies the parameters to use for the optimizer, and the defaults are different for each optimizer type. See the Hugging Face documentation for more information.

We currently support the following optimizers:

  • adamw_torch
  • adamw_torch_fused
  • sgd
  • adagrad

Learning Rate Scheduler

The lr_scheduler parameter is a dictionary that specifies the learning rate scheduler to use. The default value is {"type": "cosine_with_restarts", "params": {}}.

In particular, the params field is a dictionary that specifies the parameters to use for the learning rate scheduler, and the defaults are different for each learning rate scheduler type. See the Hugging Face documentation for more information.

We currently support the following learning rate schedulers:

  • constant_with_warmup
  • linear
  • cosine
  • cosine_with_restarts
  • inverse_sqrt
  • warmup_stable_decay

Lora Alpha

The lora_alpha parameter is a positive integer that specifies the scaling factor for the LoRA weights. This parameter is only used if the adapter type is lora or turbo_lora. The default value is None.

Lora Dropout

The lora_dropout parameter is a positive float that specifies the dropout rate for the LoRA adapter, in the range [0, 1]. Dropout is a regularization technique that randomly drops out some of the neurons in the network during training to prevent overfitting. This parameter is only used if the adapter type is lora or turbo_lora. The default value is 0.

Warmup Ratio

The warmup_ratio parameter is a positive float that specifies the warmup ratio for the learning rate scheduler. The default value is 0.03.

Effective Batch Size

Typically, we try to pick the highest batch size that can fit in memory then adjust gradient_accumulation_steps to simulate a larger batch size, giving up the effective batch size (batch_size * gradient_accumulation_steps). Setting effective_batch_size as a parameter will override that process by directly setting the effective batch size. effective_batch_size must be a valid power of 2. The default value is 16.