# Reward Functions

This module contains some useful reward functions, primarily intended for use with the [GRPOTrainer](/docs/trl/main/en/gspo_token#trl.GRPOTrainer) and [RLOOTrainer](/docs/trl/main/en/rloo_trainer#trl.RLOOTrainer).

## accuracy_reward[[trl.rewards.accuracy_reward]]

#### trl.rewards.accuracy_reward[[trl.rewards.accuracy_reward]]

[Source](https://github.com/huggingface/trl/blob/main/trl/rewards/accuracy_rewards.py#L27)

Reward function that checks if the completion matches the ground truth.
- If both gold and prediction are parseable → use math verification.
- If gold is not parseable → return `None` to skip the example.

Example:
```python
>>> from trl.rewards import accuracy_reward

>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
...     [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{3}}"}],
...     [{"role": "assistant", "content": r"My answer is \boxed{\frac{1}{2}}"}],
... ]
>>> accuracy_reward(completions, solutions)
[1.0, 0.0]
```

**Parameters:**

completions (`list[list[dict[str, str]]]`) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key `"content"` with the value being the text of the completion.

solution : (`list[str]`): List of the raw-text solutions to the questions/problems/prompts.

log_extra (`callable`, *optional*) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to `None` to allow calling the function directly outside of a trainer (e.g., for testing).

- ****kwargs** : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like [GRPOTrainer](/docs/trl/main/en/gspo_token#trl.GRPOTrainer).

## reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]

#### trl.rewards.reasoning_accuracy_reward[[trl.rewards.reasoning_accuracy_reward]]

[Source](https://github.com/huggingface/trl/blob/main/trl/rewards/accuracy_rewards.py#L117)

Reward function that removes the reasoning content and checks if the final answer matches the ground truth.
- If both gold and prediction are parseable → use math verification.
- If gold is not parseable → return `None` to skip the example.

Example:
```python
>>> from trl.rewards import reasoning_accuracy_reward

>>> reasoning_delimiters = [""]
>>> solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
>>> completions = [
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content  The final answer is \boxed{\frac{1}{3}}",
...         }
...     ],
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content  The final answer is \boxed{\frac{1}{2}}",
...         }
...     ],
...     [
...         {
...             "role": "assistant",
...             "content": r" Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
...         }
...     ],
... ]
>>> reasoning_accuracy_reward(completions, solutions, reasoning_delimiters=reasoning_delimiters)
[1.0, 0.0, 0.0]
```

**Parameters:**

completions (`list[list[dict[str, str]]]`) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key `"content"` with the value being the text of the completion.

solution : (`list[str]`): List of the raw-text solutions to the questions/problems/prompts.

reasoning_delimiters (`list[str]]`, *optional*) : List of strings indicating where the reasoning content ends. The final answer is assumed to be after the last occurrence of any of these delimiters. If `None`, defaults to `["</think>"]`.

log_extra (`callable`, *optional*) : Callable to log extra columns to the completions table, provided automatically by the trainer. Defaults to `None` to allow calling the function directly outside of a trainer (e.g., for testing).

- ****kwargs** : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like [GRPOTrainer](/docs/trl/main/en/gspo_token#trl.GRPOTrainer).

## think_format_reward[[trl.rewards.think_format_reward]]

#### trl.rewards.think_format_reward[[trl.rewards.think_format_reward]]

[Source](https://github.com/huggingface/trl/blob/main/trl/rewards/format_rewards.py#L18)

Reward function that checks if the reasoning process is enclosed within `""` and `""` tags. The
function returns a reward of 1.0 if the format is correct, otherwise 0.0.

Example:
```python
>>> from trl.rewards import think_format_reward

>>> completions = [
...     [{"content": "\nThis is my reasoning.\n\nThis is my answer."}],
...     [{"content": "\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]
```

**Parameters:**

completions (`list[list[dict[str, str]]]`) : List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key `"content"` with the value being the text of the completion.

- ****kwargs** : Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like [GRPOTrainer](/docs/trl/main/en/gspo_token#trl.GRPOTrainer).

**Returns:**

``list[float]``

A list of rewards, where each reward is 1.0 if the completion matches the expected format, otherwise 0.0.

## get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]

#### trl.rewards.get_soft_overlong_punishment[[trl.rewards.get_soft_overlong_punishment]]

[Source](https://github.com/huggingface/trl/blob/main/trl/rewards/other_rewards.py#L18)

Reward function that penalizes overlong completions. It is used to penalize overlong completions, but not to reward
shorter completions. Reference: Eq. (13) from the DAPO paper (https://huggingface.co/papers/2503.14476)

$$
R_{\text{length}}(y) = \begin{cases}
0, & |y| \le L_{\max} - L_{\text{cache}} \\
\dfrac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}}, & L_{\max} - L_{\text{cache}} 

Example:
```python
from trl.rewards import get_soft_overlong_punishment

soft_overlong_punishment = get_soft_overlong_punishment(max_completion_len=100, soft_punish_cache=20)
completion_ids = [[1] * 90]  # simulating a completion with 90 tokens. 90 is between 80 and 100.
rewards = soft_overlong_punishment(completion_ids)
print(rewards)  # [-0.5]
```

**Parameters:**

max_completion_len (`int`) : Maximum length of the completion,  \( L_{\max} \).

soft_punish_cache (`int`) : Minimum length of the completion,  \( L_{\text{cache}} \). If set to `0`, no minimum length is applied.

