Authoring native task.md tasks
A native BenchFlow task is onetask.md document plus sidecar directories.
The YAML frontmatter carries the task configuration; the markdown body is
the prompt. This page teaches the native format hands-on. For the normative
standard see the task standard.
When a directory contains both layouts, task.md is the authoritative task
definition — the runtime selects it and ignores the split pair.
Minimal task — three files
task.md, an environment/ directory with a Dockerfile, and a verifier
directory with a runnable entrypoint. An oracle/ directory is optional.
Frontmatter
task.md must start with a ----delimited YAML frontmatter block, and the
frontmatter must be a mapping — a document without it fails to parse. The keys
fall into three classes.
Task config keys are the BenchFlow task config surface, validated as
TaskConfig. Unknown keys are rejected (the schema is extra="forbid"),
so typos fail at parse time instead of becoming silently-ignored config:
| Key | Meaning |
|---|---|
schema_version (alias version) | Config schema version, currently "1.3" |
task | Package identity: name (org/name format), description, authors, keywords |
metadata | Freeform mapping — difficulty, category, tags, anything descriptive |
agent | Agent run policy: timeout_sec, user, network_mode, allowed_hosts |
verifier | Verifier run policy: timeout_sec (default 600), env, user, service, … |
environment | Sandbox: docker_image, cpus, memory_mb, storage_mb, network_mode, env, workdir, … |
oracle | Oracle run policy: env, timeout_sec (import alias: solution) |
source, artifacts, steps, multi_step_reward_strategy, reward | Provenance, artifact, and reward metadata |
agent.timeout_sec is strongly recommended: it is optional and defaults
to unset, and a task that omits it runs the agent with no wall-clock cap
unless the caller supplies a per-run timeout. Set it on every published task.
Declaring both oracle and the legacy solution alias in one config is
invalid and rejected; native tasks use oracle.
Document orchestration keys are parsed by TaskDocument, not TaskConfig:
agents (named roles with agent, model, reasoning_effort,
capabilities, …), scenes (ordered turns referencing declared roles — a
turn that names an undeclared role is a parse error), and user (simulated
user). benchflow is the reserved extension namespace.
Authoring shorthands are expanded during parsing and never reach the
canonical config under their short names:
| Shorthand | Expands to |
|---|---|
name: hello-world | task.name: benchflow/hello-world (a / in the value keeps your org) |
image: ubuntu:24.04 | environment.docker_image: ubuntu:24.04 |
verifier: verifier/ (string form) | benchflow.verifier.path / .spec / .entrypoint defaults |
oracle: oracle/ (string form) | benchflow.oracle.path |
profile: code-change | Merges a named defaults bundle (see below) |
profile: / profiles:) merge predefined default bundles —
code-change, harbor-compatible, reward-kit, acceptance-live,
multi-agent, leaderboard-local — under your explicit keys; an unknown
profile name is a parse error. bench tasks normalize <task-dir> prints the
fully expanded canonical document (--write replaces task.md in place), so
a minimal authored file and its canonical form never drift apart.
Prompt body and prompts/ sidecars
The body below the frontmatter is the base prompt — free-form markdown, no heading ceremony required. If the body contains no reserved section headings, the entire body is the instruction the agent receives. Four reserved headings are recognized for compatibility imports:## prompt,
## role:<name>, ## scene:<name>, and ## user-persona. Repeating the same
section heading is a parse error. bench tasks init scaffolds a single
## prompt section as a starting point — for a single-prompt task that is
equivalent to a bare body, so keep it or drop the heading as you prefer. The
multi-prompt headings (## role:, ## scene:, ## user-persona) are for
compatibility imports only; new multi-prompt material belongs in sidecar files
under prompts/:
| File | Meaning |
|---|---|
prompts/role.<name>.md | Role prompt — the whole file body is the prompt text |
prompts/scene.<name>.md | Scene prompt |
prompts/user-persona.md | Simulated-user persona |
prompts/role.solver.md. See
docs/examples/task-md/ for runnable examples,
including real converted SkillsBench packages.
Verifier package and strategy declaration
The native verifier directory isverifier/. At verify time the directory is
uploaded into the sandbox at /verifier, and the verifier must write its
reward to /logs/verifier/reward.txt (and optionally
/logs/verifier/reward.json).
A plain verifier/test.sh is a complete verifier: with no other declaration,
the runtime executes it directly. Write a float 0.0–1.0 to
/logs/verifier/reward.txt, then exit 0; a nonzero exit means verifier
infrastructure failure, not a scored task failure.
To declare how the task is scored, add verifier/verifier.md. Its
frontmatter must contain a verifier: mapping with at least one entry under
strategies; default_strategy selects which one runs (it defaults to the
first declared strategy and must name a declared one):
type | Required config | Notes |
|---|---|---|
script | command | Runs as cd /verifier && <command>; local script files named in the command must exist in verifier/ |
llm-judge | rubric | Optional model, input_dir, and context or context_file (not both) |
reward-kit | root | Optional entrypoint (default reward.py) and criteria; paths must be safe-relative |
agent-judge | role, isolation: verifier-only, inputs | role must match a ## role:<name> section in the verifier.md body |
ors-episode | inputs | Optional format: json, jsonl, or auto |
type is a parse error. bench tasks check also verifies the
selected strategy is actually runnable — e.g. a script strategy whose
referenced files are missing, or an llm-judge strategy whose rubric file
does not exist, fails validation.
outputs declares the reward artifact contract (defaults shown above;
details_json and aggregate_policy are optional). bench tasks check --level publication-grade additionally requires the native package shape:
task.md, native oracle/, verifier/verifier.md with rubric files, and an
explicit reward_json output contract.
Oracle
oracle/solve.sh is the held-out reference solution (solution/ is the
legacy alias; oracle/ wins when both exist). Native oracles are uploaded to
/oracle in the sandbox (legacy solution/ to /solution) and run instead
of an agent with --agent oracle:
1.0 on its oracle run before any model sees it.
Multi-container tasks
A task may ship anenvironment/docker-compose.yaml alongside the
Dockerfile. The agent always runs in the main service; any additional
services you declare become sibling containers on the same Docker network.
This supports vulhub-style CVE tasks where the agent attacks a separate target
container over the network.
environment/Dockerfileis always required —bench tasks checkrejects a task that ships only adocker-compose.yaml. If yourmainservice uses a prebuiltimage:and needs no build context, still include a minimalDockerfile(e.g.FROM <same-image>) so structural validation and other tooling agree on the task package shape.
main reaches target by service name (http://target:8080). The verifier
can inspect target-side state — not just the agent’s workspace — by passing a
service argument when running commands:
exec(..., service=...) works on the Docker sandbox and the Daytona DinD
(compose) sandbox. Single-container backends (Modal, direct Daytona) raise a
clear error for any non-main service. This lets a verifier check write-based
oracles (/tmp/exploit.txt in the target), database modifications, or RCE
markers without trusting the agent container.
Target-side verifier with verifier.service
For tasks whose success oracle lives in a target container — an RCE marker
file, a modified database row — point the verifier/test.sh verifier at that
service with the service key under verifier in the frontmatter:
verifier/ directory into the
target container, runs test.sh there, and copies the resulting
reward.txt / reward.json back to the host. service defaults to "main"
(the agent container), so single-container tasks are unaffected.
verifier.service is the declarative, task-schema way to do cross-container
verification; the env.exec_in_service(...) Python API above is the imperative
equivalent for hook-driven runs.
Use the sameservicename you declared indocker-compose.yaml. Atest.shrunning in the target reachesmain(and vice versa) by service name over the Docker network, just like the agent does.
Hardening policy for multi-container tasks
BenchFlow’s pre-verification hardening — killing the sandbox user’s processes, scrubbingPATH/PYTHONPATH, restoring build-config files — applies only to
the main (agent) container. Target containers are deliberately left
unhardened: a vulhub-style target is meant to be vulnerable, the agent never
has a shell inside it, and hardening it would risk breaking the very
vulnerability the task exercises. verifier.service selects where test.sh
runs; it does not move hardening off main.
Migrating a legacy task
bench tasks migrate converts a task.toml + instruction.md pair into
task.md:
task.toml keys that the schema does not model are preserved under
benchflow.compat in the generated frontmatter rather than dropped. After
migrating, validate the result:
Compatibility Export
To produce a compatibility split package from atask.md package, use
bench tasks export:
compatibility/export-report.json so you can see what (if anything) the
split layout cannot represent. Publication-grade validation requires
task.md to be the only authoritative entrypoint, so keep exported split
layouts in a separate output directory rather than beside task.md. See
CLI reference: bench tasks export
for all flags.