GRPOConfig

The GRPOConfig class extends SFTConfig with additional parameters for Generative Reliability Policy Optimization (GRPO) reinforcement learning. GRPO is used to fine-tune models based on reward functions.

General Hyperparameters

Parameter	Type	Required	Default	Description
`base_model`	string	Yes		The base model to fine-tune
`adapter`	string	No	`lora`	The type of adapter to use. One of `lora`, `turbo_lora`, `turbo`
`train_steps`	integer	No	-	Number of training steps (overrides epochs if set)
`learning_rate`	float	No	1e-5	Learning rate for training
`enable_early_stopping`	boolean	No	true	Whether to enable early stopping
`lr_scheduler`	object	No	`linear`	Learning rate scheduler configuration
`optimizer`	object	No	`paged_adamw_8bit`	Optimizer configuration
`warmup_ratio`	float	No	0.03	Ratio of training steps to use for warmup
`effective_batch_size`	integer	No	16	Effective batch size for training
`apply_chat_template`	boolean	No	false	Whether to apply the chat template

GRPO Specific Hyperparameters

Parameter	Type	Required	Default	Description
`beta`	float	No	0.001	Beta parameter for GRPO (value between 0 and 1)
`num_generations`	integer	No	16	Number of generations per prompt
`sampling_params`	object	No	See SamplingParamsConfig	Configuration for sampling parameters
`reward_fns`	object	No	See RewardFunctionsConfig	Configuration for reward functions

LoRA Specific Hyperparameters

Parameter	Type	Required	Default	Description
`rank`	integer	No	64	Rank of the LoRA adapter
`target_modules`	array[string]	No	All linear layers	List of model modules to fine-tune
`lora_alpha`	integer	No	64	Alpha parameter for LoRA
`lora_dropout`	float	No	0.05	Dropout rate for LoRA

SamplingParamsConfig

The SamplingParamsConfig class defines the parameters used for generating completions during the GRPO training process.

Parameter	Type	Required	Default	Description
`temperature`	float	No	0.9	Sampling temperature
`top_p`	float	No	1.0	Top-p sampling parameter
`max_tokens`	integer	No	1024	Maximum number of tokens to generate

RewardFunctionsConfig

The RewardFunctionsConfig class defines the configuration for reward functions that evaluate and score model outputs during the GRPO training process. These reward functions are used to guide the model’s learning by providing feedback on the quality of generated completions.

Parameter	Type	Required	Default	Description
`functions`	object	No		Dictionary of reward functions
`runtime`	object	No		Instance of `RewardFunctionsRuntimeConfig`

RewardFunctionsRuntimeConfig

The RewardFunctionsRuntimeConfig class optionally defines the configuration for the runtime environment of reward functions during the GRPO training process.

Parameter	Type	Required	Default	Description
`packages`	array[string]	No		Additional packages to install for reward functions

Example Usage

from predibase import GRPOConfig, SamplingParamsConfig, RewardFunctionsConfig

def format_reward(prompt: str, completion: str, example: dict) -> float:
    """Validate output format"""

    try:
        if completion.startswith("```json"):
            return 1.0
        return 0.0
    except Exception:
        return 0.0

config = GRPOConfig(
    base_model="qwen3-8b",
    sampling_params=SamplingParamsConfig(
        max_tokens=512
    ),
    reward_fns=RewardFunctionsConfig(
        functions={
            "format": format_reward
        },
        # Optional: Additional packages to install for reward functions
        # if your reward functions require additional packages
        # runtime=RewardFunctionsRuntimeConfig(
        #     packages=[
        #         "mypkg",
        #     ]
        # ),
    )
)

We highly recommend reading the following resources to understand the GRPO reinforcement learning process before kicking off a GRPO job:

Inference

Fine-Tuning

General Hyperparameters

GRPO Specific Hyperparameters

LoRA Specific Hyperparameters

SamplingParamsConfig

RewardFunctionsConfig

RewardFunctionsRuntimeConfig

Example Usage

Inference

Fine-Tuning

​General Hyperparameters

​GRPO Specific Hyperparameters

​LoRA Specific Hyperparameters

​SamplingParamsConfig

​RewardFunctionsConfig

​RewardFunctionsRuntimeConfig

​Example Usage

​Related Resources

General Hyperparameters

GRPO Specific Hyperparameters

LoRA Specific Hyperparameters

SamplingParamsConfig

RewardFunctionsConfig

RewardFunctionsRuntimeConfig

Example Usage

Related Resources