Fine-tune models using Group Relative Policy Optimization (GRPO)
prompt
fieldprompt
field containing the input text. Optionally,
you can include additional columns that can be accessed within reward functions.
prompt
: Input prompt from datasetcompletion
: Model’s generated output (generated by the model during training)example
: Dictionary containing all fields from the dataset row, including any additional columns that can be used in reward calculations.total_reward
: Average combined reward across all reward functionstotal_reward_std
: Standard deviation indicating performance consistency