Configuration options for GRPO reinforcement learning
The GRPOConfig
class extends SFTConfig
with additional parameters for
Generative Reliability Policy Optimization (GRPO) reinforcement learning. GRPO
is used to fine-tune models based on reward functions.
Parameter | Type | Required | Default | Description | |
---|---|---|---|---|---|
base_model | string | Yes | The base model to fine-tune | ||
adapter | string | No | lora | The type of adapter to use. One of lora , turbo_lora , turbo | |
train_steps | integer | No | - | Number of training steps (overrides epochs if set) | |
learning_rate | float | No | 1e-5 | Learning rate for training | |
enable_early_stopping | boolean | No | true | Whether to enable early stopping | |
lr_scheduler | object | No | linear | Learning rate scheduler configuration | |
optimizer | object | No | paged_adamw_8bit | Optimizer configuration | |
warmup_ratio | float | No | 0.03 | Ratio of training steps to use for warmup | |
effective_batch_size | integer | No | 16 | Effective batch size for training | |
apply_chat_template | boolean | No | false | Whether to apply the chat template |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
beta | float | No | 0.001 | Beta parameter for GRPO (value between 0 and 1) |
num_generations | integer | No | 16 | Number of generations per prompt |
sampling_params | object | No | See SamplingParamsConfig | Configuration for sampling parameters |
reward_fns | object | No | See RewardFunctionsConfig | Configuration for reward functions |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
rank | integer | No | 64 | Rank of the LoRA adapter |
target_modules | array[string] | No | All linear layers | List of model modules to fine-tune |
lora_alpha | integer | No | 64 | Alpha parameter for LoRA |
lora_dropout | float | No | 0.05 | Dropout rate for LoRA |
The SamplingParamsConfig
class defines the parameters used for generating completions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
temperature | float | No | 0.9 | Sampling temperature |
top_p | float | No | 1.0 | Top-p sampling parameter |
max_tokens | integer | No | 1024 | Maximum number of tokens to generate |
The RewardFunctionsConfig
class defines the configuration for reward functions that evaluate and score model outputs during the GRPO training process. These reward functions are used to guide the model’s learning by providing feedback on the quality of generated completions.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
functions | object | No | Dictionary of reward functions | |
runtime | object | No | Instance of RewardFunctionsRuntimeConfig |
The RewardFunctionsRuntimeConfig
class optionally defines the configuration for the runtime environment of reward functions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
packages | array[string] | No | Additional packages to install for reward functions |
We highly recommend reading the following resources to understand the GRPO reinforcement learning process before kicking off a GRPO job:
Configuration options for GRPO reinforcement learning
The GRPOConfig
class extends SFTConfig
with additional parameters for
Generative Reliability Policy Optimization (GRPO) reinforcement learning. GRPO
is used to fine-tune models based on reward functions.
Parameter | Type | Required | Default | Description | |
---|---|---|---|---|---|
base_model | string | Yes | The base model to fine-tune | ||
adapter | string | No | lora | The type of adapter to use. One of lora , turbo_lora , turbo | |
train_steps | integer | No | - | Number of training steps (overrides epochs if set) | |
learning_rate | float | No | 1e-5 | Learning rate for training | |
enable_early_stopping | boolean | No | true | Whether to enable early stopping | |
lr_scheduler | object | No | linear | Learning rate scheduler configuration | |
optimizer | object | No | paged_adamw_8bit | Optimizer configuration | |
warmup_ratio | float | No | 0.03 | Ratio of training steps to use for warmup | |
effective_batch_size | integer | No | 16 | Effective batch size for training | |
apply_chat_template | boolean | No | false | Whether to apply the chat template |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
beta | float | No | 0.001 | Beta parameter for GRPO (value between 0 and 1) |
num_generations | integer | No | 16 | Number of generations per prompt |
sampling_params | object | No | See SamplingParamsConfig | Configuration for sampling parameters |
reward_fns | object | No | See RewardFunctionsConfig | Configuration for reward functions |
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
rank | integer | No | 64 | Rank of the LoRA adapter |
target_modules | array[string] | No | All linear layers | List of model modules to fine-tune |
lora_alpha | integer | No | 64 | Alpha parameter for LoRA |
lora_dropout | float | No | 0.05 | Dropout rate for LoRA |
The SamplingParamsConfig
class defines the parameters used for generating completions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
temperature | float | No | 0.9 | Sampling temperature |
top_p | float | No | 1.0 | Top-p sampling parameter |
max_tokens | integer | No | 1024 | Maximum number of tokens to generate |
The RewardFunctionsConfig
class defines the configuration for reward functions that evaluate and score model outputs during the GRPO training process. These reward functions are used to guide the model’s learning by providing feedback on the quality of generated completions.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
functions | object | No | Dictionary of reward functions | |
runtime | object | No | Instance of RewardFunctionsRuntimeConfig |
The RewardFunctionsRuntimeConfig
class optionally defines the configuration for the runtime environment of reward functions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
packages | array[string] | No | Additional packages to install for reward functions |
We highly recommend reading the following resources to understand the GRPO reinforcement learning process before kicking off a GRPO job: