Guide to reinforcement learning with GRPO
temperature
: Higher values increase diversity, lower values increase consistencymax_tokens
: Maximum length of generated responsestop_p
: Nucleus sampling threshold (0.1-1.0) (for temperature
> 0)beta
, temperature
, max_tokens
)example
dictionary passed to reward functions contains string values for
all columns. Convert them as needed:
ast.literal_eval()
float()
or int()
bool()
runtime
section of the RewardFunctionsConfig
object.
total_reward
: Average reward across all reward functions, indicating overall model performancetotal_reward_std
: Standard deviation of rewards, measuring performance stability and consistencypb.adapters.get_config()
pb.adapters.update_config()
prompt
column. These
columns will be passed to your reward functions in the example
argument as a
dictionary with the key being the column name and the value being the column
value.
Please note that all values in the example
dictionary are strings, so
you will need to convert them to the appropriate type depending on your use
case. See more about data types in the next section.
example
dictionary passed to reward functions contains string values for
all columns, regardless of their original data type. This means you’ll need to
convert non-string data types back to their original form. For example:
ast.literal_eval()
float()
or int()
bool()
Reward Graphs
tab shows several charts that help you monitor the training
progress:
format_reward_func
,
correctness_reward_func
, etc. The value graphed is the average reward for
that function across all completions at the given training step.Reward Logs
tab first to see if there is an obvious issue with
it before working towards improving it.Reward Logs
tab displays any print statements from your reward functions
during training. This is a valuable debugging and monitoring tool that helps
you:
Completions
tab provides a detailed view into what your model is actually
generating during training. Here’s how to use it:
Completions
tab is particularly valuable for:
train_steps
parameter in the GRPOConfig. In most cases, we do not
recommend lowering this value below 500 - GRPO runs take some time to converge.