GRPOConfig
Configuration options for GRPO reinforcement learning
The GRPOConfig
class extends SFTConfig
with additional parameters for
Generative Reliability Policy Optimization (GRPO) reinforcement learning. GRPO
is used to fine-tune models based on reward functions.
General Hyperparameters
Parameter | Type | Required | Default | Description | |
---|---|---|---|---|---|
base_model | string | Yes | The base model to fine-tune | ||
adapter | string | No | lora | The type of adapter to use. One of lora , turbo_lora , turbo | |
train_steps | integer | No | - | Number of training steps (overrides epochs if set) | |
learning_rate | float | No | 1e-5 | Learning rate for training | |
enable_early_stopping | boolean | No | true | Whether to enable early stopping | |
lr_scheduler | object | No | linear | Learning rate scheduler configuration | |
optimizer | object | No | paged_adamw_8bit | Optimizer configuration | |
warmup_ratio | float | No | 0.03 | Ratio of training steps to use for warmup | |
effective_batch_size | integer | No | 16 | Effective batch size for training | |
apply_chat_template | boolean | No | false | Whether to apply the chat template |
GRPO Specific Hyperparameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
beta | float | No | 0.001 | Beta parameter for GRPO (value between 0 and 1) |
num_generations | integer | No | 16 | Number of generations per prompt |
sampling_params | object | No | See SamplingParamsConfig | Configuration for sampling parameters |
reward_fns | object | No | See RewardFunctionsConfig | Configuration for reward functions |
LoRA Specific Hyperparameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
rank | integer | No | 64 | Rank of the LoRA adapter |
target_modules | array[string] | No | All linear layers | List of model modules to fine-tune |
lora_alpha | integer | No | 64 | Alpha parameter for LoRA |
lora_dropout | float | No | 0.05 | Dropout rate for LoRA |
SamplingParamsConfig
The SamplingParamsConfig
class defines the parameters used for generating completions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
temperature | float | No | 0.9 | Sampling temperature |
top_p | float | No | 1.0 | Top-p sampling parameter |
max_tokens | integer | No | 1024 | Maximum number of tokens to generate |
RewardFunctionsConfig
The RewardFunctionsConfig
class defines the configuration for reward functions that evaluate and score model outputs during the GRPO training process. These reward functions are used to guide the model’s learning by providing feedback on the quality of generated completions.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
functions | object | No | Dictionary of reward functions | |
runtime | object | No | Instance of RewardFunctionsRuntimeConfig |
RewardFunctionsRuntimeConfig
The RewardFunctionsRuntimeConfig
class optionally defines the configuration for the runtime environment of reward functions during the GRPO training process.
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
packages | array[string] | No | Additional packages to install for reward functions |
Example Usage
Related Resources
We highly recommend reading the following resources to understand the GRPO reinforcement learning process before kicking off a GRPO job: