The GRPOConfig class extends SFTConfig with additional parameters for Generative Reliability Policy Optimization (GRPO) reinforcement learning. GRPO is used to fine-tune models based on reward functions.

General Hyperparameters

ParameterTypeRequiredDefaultDescription
base_modelstringYesThe base model to fine-tune
adapterstringNoloraThe type of adapter to use.
One of lora, turbo_lora, turbo
train_stepsintegerNo-Number of training steps (overrides epochs if set)
learning_ratefloatNo1e-5Learning rate for training
enable_early_stoppingbooleanNotrueWhether to enable early stopping
lr_schedulerobjectNolinearLearning rate scheduler configuration
optimizerobjectNopaged_adamw_8bitOptimizer configuration
warmup_ratiofloatNo0.03Ratio of training steps to use for warmup
effective_batch_sizeintegerNo16Effective batch size for training
apply_chat_templatebooleanNofalseWhether to apply the chat template

GRPO Specific Hyperparameters

ParameterTypeRequiredDefaultDescription
betafloatNo0.001Beta parameter for GRPO (value between 0 and 1)
num_generationsintegerNo16Number of generations per prompt
sampling_paramsobjectNoSee SamplingParamsConfigConfiguration for sampling parameters
reward_fnsobjectNoSee RewardFunctionsConfigConfiguration for reward functions

LoRA Specific Hyperparameters

ParameterTypeRequiredDefaultDescription
rankintegerNo64Rank of the LoRA adapter
target_modulesarray[string]NoAll linear layersList of model modules to fine-tune
lora_alphaintegerNo64Alpha parameter for LoRA
lora_dropoutfloatNo0.05Dropout rate for LoRA

SamplingParamsConfig

The SamplingParamsConfig class defines the parameters used for generating completions during the GRPO training process.

ParameterTypeRequiredDefaultDescription
temperaturefloatNo0.9Sampling temperature
top_pfloatNo1.0Top-p sampling parameter
max_tokensintegerNo1024Maximum number of tokens to generate

RewardFunctionsConfig

The RewardFunctionsConfig class defines the configuration for reward functions that evaluate and score model outputs during the GRPO training process. These reward functions are used to guide the model’s learning by providing feedback on the quality of generated completions.

ParameterTypeRequiredDefaultDescription
functionsobjectNoDictionary of reward functions
runtimeobjectNoInstance of RewardFunctionsRuntimeConfig

RewardFunctionsRuntimeConfig

The RewardFunctionsRuntimeConfig class optionally defines the configuration for the runtime environment of reward functions during the GRPO training process.

ParameterTypeRequiredDefaultDescription
packagesarray[string]NoAdditional packages to install for reward functions

Example Usage

from predibase import GRPOConfig, SamplingParamsConfig, RewardFunctionsConfig

def format_reward(prompt: str, completion: str, example: dict) -> float:
    """Validate output format"""

    try:
        if completion.startswith("```json"):
            return 1.0
        return 0.0
    except Exception:
        return 0.0

config = GRPOConfig(
    base_model="qwen3-8b",
    sampling_params=SamplingParamsConfig(
        max_tokens=512
    ),
    reward_fns=RewardFunctionsConfig(
        functions={
            "format": format_reward
        },
        # Optional: Additional packages to install for reward functions
        # if your reward functions require additional packages
        # runtime=RewardFunctionsRuntimeConfig(
        #     packages=[
        #         "mypkg",
        #     ]
        # ),
    )
)

We highly recommend reading the following resources to understand the GRPO reinforcement learning process before kicking off a GRPO job: