Write Harbor task verifiers using Reward Kit. Use when creating or editing a
Install via CLI
openskills install harbor-framework/harbor---
name: rewardkit
description: Write Harbor task verifiers using Reward Kit. Use when creating or editing a
task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing
verifiers that produce a reward score.
---
Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python
package that turns a directory of criteria files into a reward score. Each criterion is a
Python function call or a TOML judge file; folders become separate rewards.
## Setup in a Harbor task
Put criteria alongside `test.sh` in the task's `tests/` directory:
```
tests/
├── test.sh
├── checks.py # programmatic criteria
└── judge.toml # optional LLM/agent judge
```
`tests/test.sh`:
```bash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests
```
This runs all criteria in `/tests/` against the workspace at `/app` and writes
`/logs/verifier/reward.json`. Defaults match Harbor's conventions — no extra config needed.
If judge criteria need API keys, pass them through `task.toml`:
```toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
```
Ask whether Reward Kit should run in the agent's shared environment or in a
separate verifier environment. Prefer a separate verifier environment when judge
prompts, grading dependencies, API keys, or clean-room checks should not be
available to the agent:
```toml
[verifier]
environment_mode = "separate"
[verifier.environment]
docker_image = "python:3.12-slim"
allow_internet = true
```
In separate mode, `tests/` is the verifier image build context and must provide
`/tests/test.sh` at runtime; Harbor does not upload `tests/` into the running
verifier container.
## Programmatic criteria
Call built-ins from any `.py` file in `tests/`:
```python
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")
```
All criteria accept `weight` (default `1.0`) and `isolated` (default `False`, runs in
overlayfs so side effects don't leak).
### Available built-ins
- **Files**: `file_exists`, `file_not_exists`, `file_contains`, `file_contains_regex`,
`file_matches`, `files_equal`, `diff_ratio`
- **Commands**: `command_succeeds`, `command_output_contains`, `command_output_matches`,
`command_output_matches_regex` (30s default timeout, optional `cwd`)
- **Data**: `json_key_equals`, `json_path_equals`, `csv_cell_equals`, `xlsx_cell_equals`
(needs `[office]` extra), `sqlite_query_equals`
- **HTTP**: `http_status_equals`, `http_response_contains`
- **Images**: `image_similarity`, `image_size_equals` (needs `[image]` extra)
- **Trajectory**: `trajectory_tool_used`, `trajectory_tool_not_used`, `trajectory_turn_count`
For extras, install with `uv tool install harbor-rewardkit[all]`.
## Custom criteria
Use the `@criterion` decorator. First parameter is always `workspace: Path`. Returns
`bool` or `float`:
```python
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
return (workspace / "output.txt").read_text().strip() != ""
```
Zero-parameter criteria auto-register. Criteria with extra args must be called via `rk`:
```python
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
return len((workspace / "output.txt").read_text().splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)
```
For criteria shared across reward subdirs, define with `shared=True` in a root-level file
and call from subdirs.
## Judge criteria (LLM or agent-as-a-judge)
For subjective checks (quality, readability, edge cases), create a TOML file:
```toml
[judge]
judge = "anthropic/claude-sonnet-4-6" # LiteLLM model string
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
```
Criterion types:
- `binary` — yes/no → 1.0 or 0.0
- `likert` — 1..points, normalized to [0, 1]
- `numeric` — min..max, normalized to [0, 1]
### Agent judges
Agent judges shell out to a CLI and can explore the filesystem:
```toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true
[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"
```
Slower and more expensive than LLM judges, but they can run commands and inspect files.
### Useful `[judge]` options
`timeout` (default 300), `reasoning_effort` (`low`|`medium`|`high`), `reference` (path to
reference solution), `atif-trajectory` (evaluate the agent's trajectory), `weight`,
`prompt_template` (custom prompt with `{criteria}` placeholder).
### Scoring aggregation
```toml
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # only for threshold
```
Only affects aggregation *within* this TOML file.
## Multi-reward tasks
Put criteria in subdirectories — each becomes a separate reward:
```
tests/
├── test.sh
├── correctness/
│ └── check.py
├── structure/
│ └── files_exist.py
└── quality/
└── quality.toml
```
Produces:
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }
```
## Output files
- `/logs/verifier/reward.json` — per-reward scores
- `/logs/verifier/reward-details.json` — per-criterion results, judge reasoning, errors
## Multi-step tasks
In a multi-step task, each step has its own `tests/` under
`steps/{name}/tests/`, and the verifier runs once per step. Reward Kit behaves
the same as in a single-step task: for each step it reads `/tests`, runs the
criteria against `/app`, and writes `/logs/verifier/reward.json` for that step.
Harbor then aggregates per-step results into a trial-level reward via
`multi_step_reward_strategy` in `task.toml` — aggregation happens *outside*
Reward Kit, so don't try to encode cross-step logic in your criteria.
A task-level `tests/` directory (at the task root) is uploaded to `/tests`
first, then the step's own `tests/` is layered on top (same-name files win).
Put shared helpers (common `checks.py` functions with `shared=True`, fixture
files, a fallback `test.sh`) at the task level, and step-specific criteria
under each step.
Multi-reward subdirectories still work *within* a step: `steps/foo/tests/`
can contain `correctness/`, `structure/`, `quality/` — each produces a
separate reward key for that step, and `multi_step_reward_strategy = "mean"`
averages each key across steps. Use `"final"` when the last step is an
end-to-end check whose rewards already represent the full task.
## When to reach for what
- **Use built-ins** for file existence, string matches, command output, JSON/CSV checks,
HTTP probes.
- **Use `@criterion`** when logic is task-specific but still programmatic.
- **Use LLM judges** for subjective quality dimensions (readability, correctness of prose).
- **Use agent judges** when the rubric requires exploring the filesystem or running code
(e.g. "does the test suite actually pass?").
- **Use subdirectories** when you want separate scores (correctness vs structure vs
quality) rather than one blended number.
- **Use `isolated=True`** for any criterion that runs mutating commands, so it doesn't
corrupt the workspace for other criteria.
## Working example
See `examples/tasks/reward-kit-example/` in the Harbor repo.
No comments yet. Be the first to comment!