Reinforcement fine-tuning represents a powerful approach to optimizing language models without requiring extensive labeled data. This guide explores how reinforcement learning, particularly Group Relative Policy Optimization (GRPO), can enhance model performance.

Why Reinforcement Fine-tuning?

Traditional supervised fine-tuning methods have limitations:

  • Require high-quality labeled data
  • May not directly optimize for desired behaviors
  • Can be expensive and time-consuming
  • Limited ability to handle complex objectives

Reinforcement learning addresses these challenges by:

  • Using programmable reward functions
  • Optimizing directly for desired outcomes
  • Learning from trial and error
  • Developing generalized strategies

How GRPO Works

GRPO follows an iterative process to improve model performance:

1. Dual Model Architecture

The approach uses two models:

  • Frozen Reference Model: Maintains baseline performance
  • Trainable Policy Model: Optimized through training

This dual setup provides several benefits:

  • Prevents catastrophic forgetting
  • Maintains model stability
  • Enables controlled optimization
  • Preserves general capabilities

2. Training Process

The training loop consists of:

  1. Generation Phase:

    • Start with example prompts
    • Generate multiple completions
    • Use temperature sampling
    • Create response variations
  2. Evaluation Phase:

    • Score completions with reward functions
    • Calculate relative advantages
    • Identify high/low performing outputs
  3. Optimization Phase:

    • Update policy model weights using the advantages calculated for each completion
    • Maintain reference proximity by using the frozen reference model to generate completions
    • Track performance metrics by monitoring the reward function scores and the total reward

3. Implementation Details

Key technical aspects:

  • Uses LoRA for efficient updates
  • Leverages LoRAX (multi-LoRA) infrastructure

When to Use GRPO

Ideal Use Cases

Reasoning Tasks

  • Complex problem solving
  • Multi-step reasoning
  • Strategic planning for agents
  • Multi-agent coordination
  • Logical analysis
  • Natural language to code

Structured Outputs

  • JSON generation
  • Code completion
  • Format adherence
  • Schema validation

Optimization Goals

  • Response quality
  • Output consistency
  • Task completion
  • Error reduction

Implementing GRPO

1. Reward Function Design

Key principles for effective reward function design:

  • Clear Objectives: Define specific, unambiguous goals that align with your desired model behavior
  • Measurable Outcomes: Create quantifiable metrics that can be consistently evaluated
  • Comprehensive Coverage: Ensure the reward function captures all important aspects of the task
  • Balanced Scoring: Weight different criteria appropriately to avoid overemphasizing any single aspect
  • Robust Validation: Include error handling and edge cases to prevent reward manipulation

Example reward functions:

def format_reward(prompt: str, completion: str, example: dict) -> float:
    """Validate output format"""
    try:
        # Check format requirements
        if meets_criteria(completion):
            return 1.0
        return 0.0
    except Exception:
        return 0.0

def quality_reward(prompt: str, completion: str, example: dict) -> float:
    """Assess response quality"""
    score = 0.0
    # Add quality checks
    score += check_coherence(completion)
    score += check_relevance(prompt, completion)
    return score / 2.0 # To normalize the score to 0-1 range

2. Training Configuration

Important parameters:

  • Number of training steps: Controls total iterations (typically 1000-5000)
  • Number of generations per step: Controls the number of completions generated for each prompt (typically 8 or 16)
  • Beta: Controls the balance between exploration and exploitation (0.1-1.0).
  • Sampling parameters:
    • temperature: Higher values increase diversity, lower values increase consistency
    • max_tokens: Maximum length of generated responses
    • top_p: Nucleus sampling threshold (0.1-1.0) (for temperature > 0)

Example setup:

from predibase import GRPOConfig, SamplingParams, RewardFunctionsConfig, RewardFunctionsRuntimeConfig

config = GRPOConfig(
    base_model="model-name",
    train_steps=1000,
    num_generations=16,
    beta=0.001, # 0.001 is a good starting point
    sampling_params=SamplingParams(
      temperature=0.9,
      max_tokens=1024,
    ),
    reward_fns=RewardFunctionsConfig(
        functions={
            "format": format_reward,
            "quality": quality_reward
        }
    )
)

3. Monitoring and Optimization

Key metrics to track:

  • Reward Function Trends: Monitor how individual reward functions evolve over time to identify learning patterns and potential issues
  • Generation Diversity: Track the variety of responses being generated to ensure the model isn’t getting stuck in local optima
  • Average Reward: Watch the mean reward across generations to gauge overall progress
  • Reward Variance: Monitor the spread of rewards to understand model stability
  • Convergence Rate: Track how quickly the model reaches stable performance
  • Failure Modes: Document common types of incorrect or low-quality outputs by monitoring the logs and completions tabs.

Optimization strategies:

  • Fine-tune hyperparameters (e.g. beta, temperature, max_tokens)
  • Monitor convergence (e.g. reward function trends, total reward)

Best Practices

1. Reward Function Design

  • Start Simple: Begin with basic reward functions and incrementally add complexity as needed
  • Comprehensive Testing: Test reward functions across diverse inputs and edge cases to ensure robust behavior
  • Error Handling: Implement graceful error handling for unexpected inputs and edge cases
  • Clear Documentation: Document all assumptions, requirements, and expected behaviors
  • Graded Rewards: Use continuous scoring (0-1 range) instead of binary rewards to provide richer learning signals
  • Multi-Objective Rewards: Combine format, correctness, and other relevant metrics for comprehensive feedback
  • Validation: Include validation checks to ensure reward functions produce expected outputs
  • Modularity: Design reward functions to be modular (one reward function per objective)

2. Training Process

  • Start Small: Begin with a small number of training steps and gradually increase as needed
  • Monitor Closely: Keep a close eye on reward function trends, generation diversity, and average reward to identify potential issues
  • Wait For 40-50 Steps: Expect 40-50 steps before clear improvements
  • Validate Frequently: Regularly validate reward functions to ensure they are working as expected
  • Watch for Reward Hacking: Monitor for reward hacking by checking the logs and completions tabs

3. Reward Function Tips

Data Types in Example Dictionary

The example dictionary passed to reward functions contains string values for all columns. Convert them as needed:

  • Lists/dictionaries: Use ast.literal_eval()
  • Numbers: Use float() or int()
  • Booleans: Use bool()

Partial Credit Scoring

Adding partial credit for partially correct responses helps models learn more effectively by providing directional feedback on how close the model is to the correct solution:

  • Binary rewards (0 or 1) provide limited learning signal
  • Partial credit creates smoother learning gradients
  • Example: 0.5 points for correct approach but wrong final answer
  • Helps identify incremental improvements

External Libraries

You can use external packages for more sophisticated reward functions:

  • LLM-as-judge using OpenAI client
  • NLP libraries like spaCy
  • Scientific computing with NumPy
  • External APIs for validation

These can be provided in the runtime section of the RewardFunctionsConfig object.

4. Monitoring and Debugging

Understanding Reward Graphs

Key metrics to monitor:

  • total_reward: Average reward across all reward functions, indicating overall model performance
  • total_reward_std: Standard deviation of rewards, measuring performance stability and consistency
  • Individual reward function trends: Track progress of each reward component separately
  • Version-specific performance: Compare metrics across different model versions

Look for these positive indicators:

  • Steady upward trends in reward scores, showing continuous improvement
  • Decreasing variance in rewards over time, indicating more stable performance
  • Format rewards improving before task rewards, suggesting proper learning progression
  • Consistent progress across model versions, demonstrating reliable training

Using Reward Logs

The Reward Logs tab helps:

  • Monitor training metrics and reward trends over time
  • Debug reward function behavior using detailed logging and print statements
  • Track and troubleshoot API response times and error rates
  • Analyze periods of stagnant reward scores to identify potential issues

Analyzing Completions

The Completions tab provides powerful analysis capabilities:

  • Side-by-side comparison of model outputs across training iterations
  • Detailed reward score breakdowns for each completion
  • Visual progress tracking of model improvements over time
  • Detection of potential reward hacking attempts

Common Challenges

  1. Training Issues:

    • Unstable or flat rewards: Monitor reward trends and adjust learning rate
    • Slow convergence: Consider increasing batch size or simplifying reward criteria
    • Quality regression: Check for reward hacking and adjust reward function weights
    • Reward hacking: Implement reward shaping and validation checks
  2. Implementation Problems:

    • Complex reward calculations: Break down into smaller, testable components
    • Integration issues: Use proper error handling and logging
    • API timeouts: Implement retry logic and fallback mechanisms
  3. Reward Function Issues:

    • Non-numeric returns: Ensure all reward functions return float values
    • Inconsistent scoring: Standardize scoring ranges and normalize outputs
    • Missing error handling: Add try-catch blocks and default values
    • Slow computation: Optimize calculations and cache results when possible

Frequently Asked Questions

How many training steps should I use?

We recommend starting with at least 200 steps. Most models show clear improvements after 40-50 steps, but the optimal number depends on:

  • Dataset size and complexity
  • Number of reward functions
  • Base model size
  • Task difficulty

What if my rewards are not improving?

If rewards aren’t improving after 50 or 100 steps:

  1. Check reward function implementation
  2. Verify training data quality
  3. Consider simplifying reward criteria

Can I use external APIs in reward functions?

Yes! Here are some common considerations to keep in mind:

  • API costs and rate limits (e.g., OpenAI, Anthropic, etc.):

    • Monitor usage to stay within your budget
  • Robust error handling

    • Implement exponential backoff for retries
    • Add comprehensive logging for debugging
    • Handle various failure modes gracefully
  • Fallback mechanisms

    • Maintain local alternatives for critical functions
    • Define sensible default scores for failures

We recommend using scores between 0 and 1:

  • 0: Completely incorrect
  • 0.1-0.9: Partial credit
  • 1.0: Perfect response

This range helps:

  • Provide clear learning signals
  • Enable reward combination

Can I update reward functions during training?

Yes, you can modify reward functions while training:

  1. Get current config using pb.adapters.get_config()
  2. Update the reward function
  3. Apply changes with pb.adapters.update_config()
# Get current config
cfg = pb.adapters.get_config("myrepo/1")

# Update existing or add new functions
cfg.reward_fns["my_reward"] = my_reward_function_v2

# Apply updates
pb.adapters.update_config("myrepo/1", cfg)

Any updates to reward functions will be applied to the next training step. This is particularly useful when the model starts to converge and you want to provide more nuanced feedback to the model.

Note that significant changes to reward functions may impact training stability.

What if my model is “reward hacking”?

If the model finds unintended ways to maximize rewards (reward hacking):

  1. Add validation checks for input/output quality
  2. Implement penalties for suspicious patterns
  3. Update reward functions to discourage exploitation
  4. Use multiple reward criteria to prevent single-metric optimization
  5. Review edge cases and boundary conditions
  6. Use partial credit scoring for better learning signals

Do the rewards need to be binary?

No! You can write more intricate reward functions that return a score between 0 and 1 (such as for similarity scores or partial credit scoring). You can take a look at section 4.1 of our blog post here for an example.

In fact, adding partial credit for partially correct responses can help models learn tasks more effectively. While binary rewards (0 or 1) are simpler to implement initially, they provide limited signal for the model to learn from. If you notice your model is learning slowly with binary rewards, consider updating your reward functions to incorporate partial credit. This gives the model directional feedback on how close it is to the correct solution, creating a smoother learning gradient.

For example, in a math problem, you might give 0.5 points for getting the approach right even if the final answer is wrong, or in a code generation task, award points for correct syntax even if the algorithm is not optimal.

What if I need to use an external library in my reward function?

You can import any libraries in your reward function. However, please note that the import must be inside the function definition. If external packages need to be installed for your reward functions to work, you can specify them as follows:

from predibase import RewardFunctionsConfig, RewardFunctionsRuntimeConfig

def my_reward_function(prompt, completion, example) -> float:
    import my_pkg
    return my_pkg.score(...)

cfg = RewardFunctionsConfig(
    runtime=RewardFunctionsRuntimeConfig(
        packages=[
            "mypkg", # Add any packages you need here
        ]
    ),
    functions={
        "my_reward": my_reward_function,
    },

This may be useful for reward functions that involve using an LLM-as-a-judge and are called via the OpenAI client library. Other examples include using specialized NLP libraries like spaCy for text analysis, scientific computing packages like NumPy for mathematical scoring, or external APIs for checking code execution results.

How do I pass additional columns to my reward functions?

You can add additional columns to your dataset beyond the prompt column. These columns will be passed to your reward functions in the example argument as a dictionary with the key being the column name and the value being the column value.

Please note that all values in the example dictionary are strings, so you will need to convert them to the appropriate type depending on your use case. See more about data types in the next section.

What data types are supported in the example dictionary?

The example dictionary passed to reward functions contains string values for all columns, regardless of their original data type. This means you’ll need to convert non-string data types back to their original form. For example:

  • Lists, dictionaries, sets and other data structures can be converted using ast.literal_eval()
  • Numbers can be converted using float() or int()
  • Booleans can be converted using bool()

How do I interpret my reward graphs?

The Reward Graphs tab shows several charts that help you monitor the training progress:

  1. total_reward: Shows the average total reward across all reward functions combined. An upward trend indicates the model is learning to optimize for the defined rewards.
  2. total_reward_std: Shows the standard deviation of the total reward, indicating how consistent the model’s performance is across different examples. Over time, this should decrease as the model more consistently gets higher rewards for its completions.
  3. Individual reward function graphs: Each reward function you define has a graph showing how well the model is learning that specific objective. For example, you might see separate graphs for format_reward_func, correctness_reward_func, etc. The value graphed is the average reward for that function across all completions at the given training step.

When interpreting these graphs:

  • Look for an overall upward trend (except for the standard deviation graph, which should trend downwards). Progress may not be linear.
  • It’s normal to need 40-50 training steps before seeing clear signs of improvement.
  • Format-related rewards (like proper JSON structure) often improve first, before more complex task-specific rewards.
  • High variance (spiky graphs) is common early in training but should generally decrease over time.
  • If a specific reward function’s graph remains flat or decreases, you may need to adjust its implementation by updating the reward function. It’s a good idea to check the Reward Logs tab first to see if there is an obvious issue with it before working towards improving it.

How do I use the reward logs tab?

The Reward Logs tab displays any print statements from your reward functions during training. This is a valuable debugging and monitoring tool that helps you:

  1. Track training progress by showing which reward functions are executing, along with step and epoch numbers.
  2. Print out the reward score assigned to the completion before returning the reward score itself.
  3. Debug issues in your reward function implementations by inspecting the logged output.
  4. Monitor performance issues like API timeouts or errors.
  5. Investigate why reward scores might be flatlining or behaving unexpectedly.

How do I use the completions tab?

The Completions tab provides a detailed view into what your model is actually generating during training. Here’s how to use it:

  1. Select a Prompt: Use the prompt selection table to choose any prompt from your training set. The table displays:

    • Prompt Index: A unique identifier for each prompt.
    • Prompt Text: The actual text of the prompt shown to the model.
    • Prompt Length: Number of tokens in the prompt.
    • Additional Columns: Any extra metadata or context provided with the prompt.
    • Epoch: The training epochs where this prompt was used.
    • Total Completions: Number of completions generated for this prompt.
  2. Compare Completions: This section allows you to compare two completions side-by-side.

    • By default the right column shows the best completion generated at the most recent epoch. The left column allows you to select any other completion from the same prompt (from any epoch).
    • A slider allows you to pick the epoch for the left column, and arrows allow you to switch between completions at that epoch.
    • Each column displays the completion text and all individual reward scores (and the total reward).
    • You can also easily compare any two completions (not just against the best one from the current epoch).
  3. View all completions: Shows information about every completion generated for the selected prompt.

    • Epoch: The training epoch when the completion was generated.
    • Completion: The full text output from the model.
    • Individual reward scores: Separate columns for each reward function (e.g. format_reward_func_v1).
    • Total: The combined reward score across all functions.
    • Length: Number of tokens in the completion.
    • View: Allows you to display this completion in the “Compare Completions” section.

The Completions tab is particularly valuable for:

  • Verifying that higher reward scores correspond to better quality outputs.
  • Detecting reward hacking (where the model games the reward functions).
  • Understanding how responses evolve throughout training.
  • Identifying areas needing improved reward functions.
  • Detecting breakthrough moments in the training process.
  • Validating proper output formatting and structure.

How do I control the length of my GRPO training run?

By default, GRPO training runs for 1000 steps. You can modify this setting using the train_steps parameter in the GRPOConfig. In most cases, we do not recommend lowering this value below 500 - GRPO runs take some time to converge.

Any additional tips?

It is usually good to have a format reward function so the model learns to respond in the format you want and a correctness reward function at the very least to see if the answer/expected output is good.

Next Steps