Reinforcement Fine-tuning (GRPO)
Overview
With the advent of reasoning models like DeepSeek R1, reinforcement learning (RL) has become more effective and important for fine-tuning models. This is because traditional post-training methods like supervised fine-tuning don't work as well for reasoning models as they do for non-reasoning models. While reasoning models are great out of the box, they can be significantly improved through reinforcement learning on downstream tasks.
Predibase now supports reinforcement learning through Group Relative Policy Optimization (GRPO), an innovative RL method introduced in DeepSeek's groundbreaking R1 paper. Unlike traditional RL approaches such as Reinforcement Learning with Human Feedback (RLHF), which require collecting labeled preference data to train reward models, GRPO enables direct optimization of model behavior using programmable reward functions. This allows models to develop generalized strategies for solving tasks without the need for any (or extensive) human feedback data collection.
How does GRPO work?
GRPO follows an iterative training process:
Dataset Input: The process begins with creating a dataset of 10s of examples containing prompts that will be used to train the model.
Models: Inside the GRPO Trainer, two components work together:
- Frozen LLM: Acts as a reference model that maintains baseline performance
- Trainable LLM: The model being optimized through the training process
The reason there are two models is to maintain stability during training. The frozen LLM acts as an anchor point, ensuring the trainable LLM doesn't deviate too drastically from the original model's behavior while still optimizing for the target task. This dual-model approach helps prevent catastrophic forgetting and maintains the model's general capabilities while improving on the specific task at hand.
Completion Generation: For each prompt, the training model (often called the policy model in RL literature) generates N different completions through temperature-based sampling. This sampling method introduces controlled randomness into the generation process, producing variations in the completions while maintaining coherence. By generating multiple slightly different completions for the same prompt, we create a diverse set of responses that can be evaluated in the next stage.
Reward Scoring: A reward server evaluates each completion using predefined programmable reward functions, assigning scores based on the quality or correctness of the responses. These scores are summed up to produce a final score for each completion in the group. The mean and standard deviation of these scores are then used to calculate " advantages" - identifying which completions performed above average (positive advantage) and which performed below average (negative advantage) within the group. This grouping into above/below average completions provides clear learning signals to the model about which patterns to reinforce and which to avoid in future generations.
Iterative Improvements: This process repeats continuously during training, allowing the model to progressively learn better strategies for generating high-quality completions. Through each iteration:
- The model refines its approach based on the reward signals, learning which patterns lead to higher rewards
- The policy model's weights are updated to favor strategies that produced better completions
- The latest version of the policy model is used to generate completions for the next batch of prompts
- The proximity to the reference model helps maintain stability while improving performance
This iterative cycle enables the model to discover and reinforce effective reasoning patterns while avoiding behaviors that lead to lower rewards.Over time, the model improves through direct optimization of task-specific metrics without requiring explicit labeled data or human feedback.
We've optimized this process by leveraging Low-Rank Adaptation (LoRA) instead of updating the full model weights. LoRA adapters allow us to efficiently fine-tune models by only training a small set of parameters while keeping the base model frozen. We use high-rank LoRA adapters by default to ensure sufficient model capacity for learning complex reasoning patterns. The training process is powered by our production-grade multi-LoRA serving infrastructure, LoRAX, which enables rapid iteration between training steps by efficiently updating and serving the policy model within the training loop.
From our experiments, we've found that GRPO is not limited to showing improvements only on reasoning models - it works remarkably well even for non-reasoning models like Qwen-2.5 and Llama-3. This makes it a versatile approach for enhancing model performance across different model architectures and capabilities.
When to use Reinforcement Fine-Tuning
The flowchart below helps determine whether to use Reinforcement Fine-Tuning (RFT), Supervised Fine-Tuning (SFT), or RLHF based on your data and task characteristics.
How To Use GRPO in Predibase
Reinforcement fine-tuning is only supported in Developer and Enterprise Tiers. Users in the Free Tier will need to upgrade in order to kick off GRPO training jobs, or book a demo with our team.
Prepare Your Dataset
For RFT, you need, at minimum, a text dataset with a prompt
field. Your
dataset can optionally have other columns as well. When provided, these columns are accessible within your defined reward
functions if you need to access these fields and their values.
Defining Reward Functions
After you have uploaded your dataset, you can define one or more reward function. These reward functions will be used to score your model's generations during training.
All reward functions must follow this function signature:
def reward_fn(prompt: str, completion: str, example: dict[str, str]) -> float
The prompt
is a prompt from your dataset. The completion
is one of N of the model's generated outputs for the given
prompt. The example
is the original data sample from your dataset represented as a dictionary. If you define more than
one reward function, you should make sure you give each reward function a unique function name.
Note: If your reward function returns None
or a non-numeric value (not an integer or float), the system will
automatically assign a default score of 0. Make sure your reward functions always return valid numeric values to
properly guide the training process.
Example Reward Functions
Here are two example reward functions that demonstrate common patterns in GRPO reward functions:
- A format reward function that ensures the model's output follows a specific structure
- An equation reward function that validates mathematical correctness for a generated math expression
The example below is from a math problem-solving task where the model needs to:
- Generate equations using specific numbers to reach a target value
- Format its response with both reasoning (
<think>
tags) and final answer (<answer>
tags) - this is specified in the system prompt.
Example input prompt for this task:
<|im_start|>system
You are a helpful assistant. You first think about the reasoning process step by step
and then provide the user with an answer.<|im_end|>
<|im_start|>user
Using the numbers [17, 64, 63, 26], create an equation that equals 44. You can use
basic arithmetic operations (+, -, *, /) and parentheses, and each number can only
be used once. Show your work in <think> </think> tags. And return the final equation
and answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
Format Reward Function
def format_reward_func(prompt: str, completion: str, example: dict[str, str]) -> float:
# Imported packages must be inside each reward function
import re
try:
# Check if the format matches expected pattern:
# <think> content </think> followed by <answer> content </answer>
regex = (
r"^<think>\s*([^<]*(?:<(?!/?think>)[^<]*)*)\s*<\/think>\n"
r"<answer>\s*([\s\S]*?)\s*<\/answer>$"
)
match = re.search(regex, completion, re.DOTALL)
if match is not None and len(match.groups()) == 2:
return 1.0
return 0.0
except Exception:
return 0.0
This reward function checks if the model's output follows the expected format we've asked for of providing step-by-step
reasoning within <think>
tags followed by the final answer within <answer>
tags, returning 1.0 if the format is
correct and 0.0 otherwise.
Format reward functions are important for two key reasons:
They encourage chain-of-thought reasoning by requiring the model to explicitly show its work, which often leads to more accurate solutions.
They enforce consistent output structures that can be reliably parsed for downstream validation and processing. In this example, the format allows us to extract the final answer from the
<answer>
tags to verify its correctness.
When combined with other reward functions (like the equation validator), format rewards help create a multi-objective optimization that balances both structural compliance and task-specific correctness.
Equation Reward Function
def equation_reward_func(prompt: str, completion: str, example: dict[str, str]) -> float:
# Imported packages must be inside each reward function
import re
import ast
try:
match = re.search(r"<answer>\s*([\s\S]*?)\s*<\/answer>", completion)
if not match:
return 0.0
# Extract and validate equation
equation = match.group(1).strip()
if not re.match(r'^[\d+\-*/().\s]+$', equation):
return 0.0
# Extract and validate numbers
used_numbers = [int(n) for n in re.findall(r'\d+', equation)]
nums = ast.literal_eval(example["nums"]) if isinstance(example["nums"], str) else example["nums"]
if sorted(used_numbers) != sorted(nums):
return 0.0
# Evaluate equation and check result
result = eval(equation, {"__builtins__": None}, {})
if abs(float(result) - float(example["target"])) < 1e-5:
return 1.0
return 0.0
except Exception:
return 0.0
This reward function evaluates the actual correctness of the task:
- Extracts the equation from within the
<answer>
tags - Validates that the equation only contains valid mathematical symbols and numbers
- Checks that the equation uses exactly the numbers provided in the input (no more, no less) since that is the required constraint in our problem
- Evaluates the equation and verifies it equals the target value
- Returns 1.0 if all checks pass, 0.0 otherwise
Start GRPO Training Job
Once you have your reward functions defined, you can pass them in through the GRPOConfig in the Predibase SDK. See the
GRPO example notebook
for details on creating the countdown_train
dataset.
from predibase import GRPOConfig, RewardFunctionsConfig
adapter = pb.adapters.create(
config=GRPOConfig(
base_model="qwen2-5-7b-instruct",
reward_fns=RewardFunctionsConfig(
functions={
"format": format_reward_func,
"answer": equation_reward_func,
},
)
),
dataset="countdown_train",
repo=repo,
description="..."
)
Pricing
RFT training is billed by the number of tokens in your dataset and how many epochs you train. The latest pricing can be found on our pricing page in the fine-tuning section. Note that RFT requires more compute than typical SFT fine-tuning, so the prices are higher.
Reward Function FAQs
How do I pass additional columns to my reward functions?
You can add additional columns to your dataset beyond the prompt
column. These columns will be passed to your reward
functions in the example
argument as a dictionary with the key being the column name and the value being the column
value. Please note that all values in the example
dictionary are strings, so you will need to convert them to the
appropriate type depending on your use case. See more here.
Do the rewards need to be binary?
Nope! You can write more intricate reward functions that return a score between 0 and 1 (such as for similarity scores or partial credit scoring). You can take a look at section 4.1 of our blog post here for an example.
In fact, adding partial credit for partially correct responses can help models learn tasks more effectively. While binary rewards (0 or 1) are simpler to implement initially, they provide limited signal for the model to learn from. If you notice your model is learning slowly with binary rewards, consider updating your reward functions to incorporate partial credit. This gives the model directional feedback on how close it is to the correct solution, creating a smoother learning gradient. For example, in a math problem, you might give 0.5 points for getting the approach right even if the final answer is wrong, or in a code generation task, award points for correct syntax even if the algorithm is not optimal.
What if I need to use an external library in my reward function?
You can import any libraries in your reward function. However, please note that the import must be inside the function definition. If external packages need to be installed for your reward functions to work, you can specify them as follows:
from predibase import RewardFunctionsConfig, RewardFunctionsRuntimeConfig
def my_reward_function(prompt, completion, example) -> float:
import my_pkg
return my_pkg.score(...)
cfg = RewardFunctionsConfig(
runtime=RewardFunctionsRuntimeConfig(
packages=[
"mypkg",
]
),
functions={
"my_reward": my_reward_function,
},
)
For example, this may be useful for reward functions that involve using an LLM-as-a-judge and are called via the OpenAI client library. Other examples include using specialized NLP libraries like spaCy for text analysis, scientific computing packages like NumPy for mathematical scoring, or external APIs for checking code execution results.
Data Types in Example Dictionary
The example
dictionary passed to reward functions contains string values for all columns, regardless of their original
data type. This means you'll need to convert non-string data types back to their original form. For example:
- Lists, dictionaries, sets and other data structures can be converted using
ast.literal_eval()
- Numbers can be converted using
float()
orint()
- Booleans can be converted using
bool()
How do I update my reward functions?
As your GRPO training run progresses, you may wish to update, add, or remove reward functions to improve the learning progress of your model. Here is an example of using the Predibase SDK to update your reward functions:
# An updated version of an existing reward function. Note that the name of the function does not matter, a reward
# function can be updated with a new function with a different name.
def my_reward_function_v2(prompt, completion, example) -> float:
import my_pkg
return my_pkg.score_v2(...)
# A completely new reward function.
def my_new_reward_function(prompt, completion, example) -> float:
return 1.0
# Get the current configuration.
cfg = pb.adapters.get_config("myrepo/1")
# Inspect the existing reward function's source code.
print(cfg.reward_fns["my_reward"].source)
# Get the actual callable function.
old_func = cfg.reward_fns["my_reward"].function
# Update an existing reward function by assigning to the existing key `my_reward`.
cfg.reward_fns["my_reward"] = my_reward_function_v2
# Add a completely new reward function by assigning to a new key.
cfg.reward_fns["new"] = my_new_reward_function
# Apply the update.
pb.adapters.update_config("myrepo/1", cfg)
The updated set of reward functions will automatically be picked up by your training job during the next training step!
Additionally, when you update an existing reward function by assigning to an existing key (like my_reward
above),
Predibase remembers that you've created a new version of the reward function and will display metrics and progress
accordingly.
You can view the complete history and lineage of your reward functions, including all versions and
implementations, in the Reward Functions
tab.
How do I interpret my reward graphs?
The Reward Graphs
tab shows several charts that help you monitor the training progress:
total_reward: Shows the average total reward across all reward functions combined. An upward trend indicates the model is learning to optimize for the defined rewards.
total_reward_std: Shows the standard deviation of the total reward, indicating how consistent the model's performance is across different examples. Over time, this should decrease as the model more consistently gets higher rewards for its completions.
Individual reward function graphs: Each reward function you define has a graph showing how well the model is learning that specific objective. For example, you might see separate graphs for
format_reward_func
,correctness_reward_func
, etc. The value graphed is the average reward for that function across all completions at the given training step.All versions of a reward function are plotted on the same graph, allowing you to track performance across different versions and combinations of reward functions.
When interpreting these graphs:
- Look for an overall upward trend (except for the standard deviation graph, which should trend downwards). Progress may not be linear.
- It's normal to need 40-50 training steps before seeing clear signs of improvement.
- Format-related rewards (like proper JSON structure) often improve first, before more complex task-specific rewards.
- High variance (spiky graphs) is common early in training but should generally decrease over time.
- If a specific reward function's graph remains flat or decreases, you may need to adjust its implementation by updating
the reward function. It's a good idea to check the
Reward Logs
tab first to see if there is an obvious issue with it before working towards improving it.
The example graphs above show typical learning curves where the model gradually improves its reward scores over training steps, with some expected fluctuation along the way.
How do I use the reward logs tab?
The Reward Logs
tab displays any print statements from your reward functions during training. This is a valuable
debugging and monitoring tool that helps you:
- Track training progress by showing which reward functions are executing, along with step and epoch numbers.
- Print out the reward score assigned to the completion before returning the reward score itself.
- Debug issues in your reward function implementations by inspecting the logged output.
- Monitor performance issues like API timeouts or errors.
- Investigate why reward scores might be flatlining or behaving unexpectedly.
How do I use the completions tab?
The Completions
tab provides a detailed view into what your model is actually generating during training. Here's how
to use it:
Select a Prompt: Use the prompt selection table to choose any prompt from your training set. The table displays:
- Prompt Index: A unique identifier for each prompt.
- Prompt Text: The actual text of the prompt shown to the model.
- Prompt Length: Number of tokens in the prompt.
- Additional Columns: Any extra metadata or context provided with the prompt.
- Epoch: The training epochs where this prompt was used.
- Total Completions: Number of completions generated for this prompt.
This gives you a starting point to analyze the model's outputs and understand how different prompts were used during training.
By default Predibase detects common prefixes shared by all prompts and does not display them in the prompt text. This is primarily intended for the system prompt, but may also include text from the user prompt itself (if all user prompts share a common prefix). To show the common prefix, select the "Show detected common prefix" toggle.
Compare Completions: This section allows you to compare two completions side-by-side.
- By default the right column shows the best completion generated at the most recent epoch. The left column allows you to select any other completion from the same prompt (from any epoch).
- A slider allows you to pick the epoch for the left column, and arrows allow you to switch between completions at that epoch.
- Each column displays the completion text and all individual reward scores (and the total reward).
- You can also easily compare any two completions (not just against the best one from the current epoch). See the "View all completions" section below for more details.
View all completions: Shows information about every completion generated for the selected prompt.
- Epoch: The training epoch when the completion was generated.
- Completion: The full text output from the model.
- Individual reward scores: Separate columns for each reward function (e.g. format_reward_func_v1).
- Total: The combined reward score across all functions.
- Length: Number of tokens in the completion.
- View: Allows you to display this completion in the "Compare Completions" section (see the previous section).
Selecting
L
displays the completion in the left column, and selectingR
displays it on the right. This allows you to easily compare any two completions side-by-side.
The Completions
tab is particularly valuable for:
- Verifying that higher reward scores correspond to better quality outputs.
- Detecting reward hacking (where the model games the reward functions).
- Understanding how responses evolve throughout training.
- Identifying areas needing improved reward functions.
- Detecting breakthrough moments in the training process.
- Validating proper output formatting and structure.
Any additional tips?
It is usually good to have a format reward function so the model learns to respond in the format you want and a correctness reward function at the very least to see if the answer/expected output is good.