Reinforcement Fine-Tuning
Fine-tune models using Group Relative Policy Optimization (GRPO)
Predibase supports reinforcement learning through Group Relative Policy Optimization (GRPO), which enables direct optimization of model behavior using programmable reward functions. Unlike traditional RL approaches like PPO and DPO that require labeled preference data, GRPO allows models to develop generalized strategies for solving tasks without extensive human feedback data collection.
Reinforcement fine-tuning is only supported in Developer and Enterprise Tiers. Users in the Free Tier will need to upgrade or book a demo with our team.
For more detailed information about GRPO and best practices, we highly recommend checking out our comprehensive user guide.
Quick Start
To get started with reinforcement fine-tuning:
- Prepare a dataset with a
prompt
field - Define your reward functions
- Start training with GRPO
For a deeper understanding of how GRPO works and when to use it, see our reinforcement learning guide.
Dataset Requirements
Your dataset must include prompt
field containing the input text. Optionally,
you can include additional columns that can be accessed within reward functions.
Define Reward Functions
What are reward functions?
Reward functions are Python functions that evaluate the quality of a model’s output by assigning a numerical score. They serve as the training signal that guides the model to improve its responses. Each reward function can focus on different aspects of the output, such as following a specific format, maintaining factual accuracy, or meeting task-specific requirements.
All reward functions must follow this Python function signature:
Parameters:
prompt
: Input prompt from datasetcompletion
: Model’s generated output (generated by the model during training)example
: Dictionary containing all fields from the dataset row, including any additional columns that can be used in reward calculations.
Return value:
- Numeric score (can be int or float, 0-1 range recommended)
Non-numeric return values default to a reward of 0. Always return valid numeric values to properly guide training.
Example Reward Functions
Here are two example reward functions that demonstrate common patterns in GRPO reward functions:
- A format reward function that ensures the model’s output follows a specific structure
- A task specific reward function that can grade the ability of the model to complete your desired specific task. For example, if you are training a model to solve math word problems, you can use a task specific reward function to grade the accuracy of the model’s solution
Let’s say you are training a model to play countdown - a game where the model must create an equation using given numbers to reach a target number, using only basic math operations (+, -, *, /).
Example input prompt for this task:
For this task, you can define a task specific reward function that grades the accuracy of the model’s solution.
Format Reward Function
Equation Reward Function
Start Training
Use the Predibase SDK to configure and start training:
Monitor Training
You can find more detailed information in our comprehensive user guide.
Reward Graphs
Monitor training progress through:
total_reward
: Average combined reward across all reward functionstotal_reward_std
: Standard deviation indicating performance consistency- Individual reward function graphs showing learning progress for each objective in the Predibase UI
When interpreting these graphs:
- Look for an overall upward trend in rewards
- Expect 40-50 training steps before clear improvements
- Format-related rewards often improve before complex task-specific rewards
- High variance early in training should decrease over time
Reward Logs
Use the Reward Logs tab to:
- Track training progress
- Debug reward function issues
- Monitor performance problems
- Investigate reward score behavior
Completions Tab
Compare model outputs during training:
- View completions side-by-side
- See reward scores for each completion
- Track improvements across epochs
- Detect reward hacking
- Validate output formatting
Using External Libraries
Import additional packages in reward functions:
Update Reward Functions
You can also modify reward functions during training. This is useful if you want to update a reward function to include new criteria or change the way it is evaluated, which is useful both to fix issues and to improve learning as it starts to saturate.
Next Steps
For more detailed information about GRPO and best practices, we highly recommend checking out our comprehensive user guide.
To see example use cases and implementations: