> ## Documentation Index
> Fetch the complete documentation index at: https://benchflow-bff148e7-codex-separate-integration-eval-env.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Task standard

# BenchFlow Task Package Standard

Status: v0.6.2 — current stable task package standard (2026-06-14)

This document defines the direction for BenchFlow-native task packages. The
short version: `task.md` is the native authoring entrypoint; `oracle/` and
`verifier/` are the BenchFlow-native names for held-out reference behavior and
reward checks. Split-layout names such as `solution/` and `tests/` remain
compatibility names for migration and export only.

The standard is intentionally split into three views:

| View                 | Purpose                                                                 | Owner                                               |
| -------------------- | ----------------------------------------------------------------------- | --------------------------------------------------- |
| Authoring document   | What humans and generators write: one `task.md` plus sidecar dirs       | `TaskDocument`                                      |
| Runtime task view    | What rollout, verifier, hardening, provenance, and trajectories consume | `TaskRuntimeView` plus first `TaskPackage` boundary |
| Foreign adapter view | What external benchmark formats and hosted environments import/export   | adapters                                            |

Do not treat these as the same interface. A good authoring document can include
more information than a foreign format can export, and a foreign import can
preserve unknown data that native authoring would reject.

## Goals and scope

A BenchFlow task is one `task.md` that selects a mode on each of three planes
(see *Planes* below): how the environment is built, how the agent interacts, and
how the result is scored. The standard has three goals:

1. **Native authoring** — humans and generators write one `task.md` (plus sidecar
   dirs) as the primary surface.
2. **Interoperability** — existing split-layout formats (`task.toml` +
   `instruction.md` + `solution/` + `tests/`) import directly, and packages export
   back to that layout with an explicit, honest loss report when a native concept
   has no equivalent.
3. **Coverage** — the schema can express the full range of eval shapes (single-shot,
   multi-round, simulated-user, multi-agent, and live-arena interaction; workspace,
   trajectory, rubric, judge, and leaderboard reward), even where the runtime for a
   given mode lands in a later milestone (see *Open Primitives and Roadmap*).

Native authoring is the priority surface; import is best-effort-faithful; export is
best-effort with a loss report. The standard does not require bidirectional-lossless
round-tripping with any one external format.

## Field discipline

Every normative standard field must map to a mode on one of the three planes
(below) or to a concrete BenchFlow runtime need. Fields without that mapping
belong under `metadata` or in an adapter, not in the standard.

## Native Layout

```text theme={null}
task/
|-- task.md
|-- environment/
|   `-- Dockerfile
|-- verifier/
|   |-- verifier.md
|   `-- test.sh
|-- oracle/
|   `-- solve.sh
`-- evidence/
    `-- validation.json
```

Compatibility aliases:

| Native      | Compatibility / foreign export | Meaning                                                           |
| ----------- | ------------------------------ | ----------------------------------------------------------------- |
| `task.md`   | `task.toml` + `instruction.md` | Task config plus prompt material                                  |
| `oracle/`   | `solution/`                    | Held-out reference implementation                                 |
| `verifier/` | `tests/`                       | Verifier package, reward code, hidden checks, rubrics, and judges |

Target validation: native packages may carry both native and compatibility
directories only when hashes prove the duplicate content is equivalent. Current
runtime prefers the native spelling when both aliases exist; it does not yet
prove equivalence.

Within a BenchFlow-native package, selection is fail-closed:

* if `task.md` exists, it is the authoritative task definition
* if `verifier/` exists, it is the authoritative verifier directory; an empty
  or invalid `verifier/` does not fall back to `tests/`
* if `oracle/` exists, it is the authoritative oracle directory; an empty or
  invalid `oracle/` does not fall back to `solution/`
* duplicate alias trees must be byte-identical after normalized traversal or
  validation should report a collision

As of v0.6.2, split layouts are compatibility inputs and export artifacts, not
the native authoring surface. New BenchFlow tasks should publish `task.md` as
the only authoritative entrypoint.

## Versioning

There are two versions:

```yaml theme={null}
schema_version: "1.3"        # BenchFlow task config surface
benchflow:
  document_version: "0.6"    # BenchFlow task.md document syntax
```

`schema_version` is for the runtime config model shared with compatibility
imports. `benchflow.document_version` is for document-only concepts such as
teams, prompt composition, agent policy, runtime policy, private assets,
provenance, evidence, nudges, and export policy.

## Root Frontmatter

The root frontmatter has three classes of keys.

BenchFlow config keys are modeled by `TaskConfig` and must be rejected
when unknown in native authoring mode:

* `schema_version` / `version`
* `task`
* `metadata`
* `agent`
* `verifier`
* `environment`
* `oracle` (validation alias: `solution`)
* `source`
* `artifacts`
* `steps`
* `multi_step_reward_strategy`

Document orchestration keys are parsed by `TaskDocument`:

* `agents`
* `scenes`
* `user`

BenchFlow extension keys live under the reserved namespace:

* `benchflow`

Do not add new root keys for every new idea. Put draft or BenchFlow-specific
extensions under `benchflow:` until they have a stable interface.

Native authoring should use `oracle`. Importers may accept the compatibility
`solution` alias, but a config that contains both names is invalid.

## Prompt Body

The `task.md` body **is** the base prompt — free-form markdown, exactly like a
`SKILL.md` body. No `## prompt` heading is required: if the body carries no
reserved section headings, the entire body is the prompt. The common single-shot
task is just frontmatter plus prose, so a bespoke benchmark ports by dropping its
existing instruction text in as the body with no markup.

Tasks that need more than a base prompt — multiple roles, multiple scenes, or a
simulated-user persona — author each as its **own** free-form file under
`prompts/`, so no body ever carries reserved-heading ceremony:

| File                      | Meaning                |
| ------------------------- | ---------------------- |
| `prompts/role.<name>.md`  | Role prompt            |
| `prompts/scene.<name>.md` | Scene prompt           |
| `prompts/user-persona.md` | Simulated user persona |

Each sidecar file is itself a clean free-form body. For backward compatibility,
single-prompt source formats may instead embed the same content in the body via
reserved `## prompt`, `## role:<name>`, `## scene:<name>`, and `## user-persona`
headings; these import losslessly and normalize to the file layout. Sidecar files
take precedence over a heading of the same name. New tasks should prefer the files.

Default runtime precedence remains a simple fallback:

1. inline turn prompt
2. scene prompt
3. role prompt
4. base prompt

Native tasks can make composition explicit:

```yaml theme={null}
benchflow:
  prompt:
    composition: append
    order: [base, role, scene, turn]
```

The first `TaskPackage` prompt plan now compiles `append` and explicit `replace`
policies deterministically. That avoids losing role guardrails when a scene
prompt is present. `RolloutConfig` consumes the compiled plan for task.md scene
execution when explicit CLI/SDK prompts do not override the document.

## Agent And Runtime Policy

Agent isolation is distinct from task environment networking. A no-search task
can still need dependency, LLM, or provider egress outside the sandbox; a
no-network task can still need the agent harness to call its model.

```yaml theme={null}
benchflow:
  agent_policy:
    skill_access: none          # none | installed | declared
    allowed_skill_roots: []
    search: disabled            # allowed | disabled
    forbidden_tools: [web_search, browser_fetch]
    trajectory_audit:
      require_no_forbidden_tool_calls: true
  runtime_policy:
    backend: modal              # docker | daytona | modal | kubernetes | podman | hpc | queue
    required_capabilities: [gpu:B200, private_mounts, persistent_state]
    network:
      task_default: no-network
      agent_egress: model-and-dependencies
      allowed_hosts: []
      phase_overrides:
        verifier: no-network
    private_mounts:
      - source: modal-volume://org-models/qwen
        target: /mnt/models/qwen
        mode: ro
        visibility: agent
        secret_ref: org-models-readonly
    persistent_state:
      required: true
      scope: task-run
      cleanup: after-verifier
```

Unsupported `agent_policy`, `runtime_policy`, private mounts, registry secrets,
GPU types, phase-specific network overrides, or persistent state semantics must
fail closed before launch for the selected sandbox. No-search proof must come
from both launch policy and trajectory audit; it is not implied by
`network_mode: no-network`.

## Planes

The package has six planes. Keeping them separate is the core abstraction.

| Plane       | Owns                                                                                                                         | Should not own       |
| ----------- | ---------------------------------------------------------------------------------------------------------------------------- | -------------------- |
| Package     | identity, versions, provenance, source hashes, export policy                                                                 | sandbox execution    |
| Runtime     | environment image, resources, network policy, setup/reset/readiness, mounts                                                  | prompt orchestration |
| Interaction | agents, roles, teams, scenes, turns, user/nudge loops, handoff policy                                                        | verifier checkpoints |
| Verifier    | verifier package entrypoint, reward file contract, hidden fixtures, scorer type, rubrics, judge roles, separate verifier env | agent prompt text    |
| Oracle      | held-out reference implementation and oracle-only env                                                                        | tests/verifier code  |
| Evidence    | validation runs, flake data, anti-cheat review, artifact hashes, leaderboard metadata                                        | mutable task source  |

Imported `steps` are verifier/runtime checkpoints. BenchFlow `scenes` are
interaction checkpoints. They are orthogonal. A task can have both, but a
runtime must define how they compose before executing them.

Of the six planes, three are the **authoring axes** an author selects a mode from
— Runtime (the environment), Interaction, and Verifier — while the remaining three
(Package, Oracle, Evidence) are supporting planes. A task is one mode-selection
per axis: `environment` × `interaction-mode` × `verifier-strategy`. New normative
fields must add or compose a mode on one of these axes; otherwise they belong in
`metadata` or an adapter.

## Presets And Profiles

Two different reuse mechanisms were historically both called "profile," which is
why the mental model felt overloaded. v0.6 separates them:

* **`preset`** — *authoring sugar*. A named bundle of defaults that `bench tasks
  normalize` expands into the canonical contract and then **discards**; it has no
  runtime existence. Presets are the lightweight-authoring path: a tiny `task.md`
  becomes a full contract. Examples: `code-change`, `acceptance-live`,
  `harbor-compatible`. (The current implementation still spells the preset key as
  `profile:`; M0 renames it to `preset:` and keeps `profile:` as a deprecated
  alias.)
* **`*-profile`** — *runtime-real shared config*. A benchmark-level object that
  many tasks **reference** and that **persists into `TaskRuntimeView`**, one per
  authoring axis:
  * `environment-profile` — a shared world reused across tasks (e.g. one
    clawsbench service catalog referenced by all 44 tasks)
  * `agent-profile` — shared harness / model / role wiring
  * `verifier-profile` — shared reward strategy and rubric

Rule of thumb: if removing it changes only how much you typed, it is a `preset`;
if removing it changes what runs, it is a `*-profile`. Presets normalize away;
profiles are inherited. `*-profile` resolution is M1 runtime work; `preset` exists
today.

## Verifier Package

The verifier is a peer package, not just a `tests/` directory. Native verifier
packages should have their own entry document:

```text theme={null}
verifier/
|-- verifier.md
|-- task.toml
|-- test.sh
|-- rewards/
|   |-- correctness/
|   |   `-- reward.py
|   `-- quality/
|       `-- criteria.toml
|-- rubrics/
|   |-- verifier.md
|   `-- verifier.toml
|-- judges/
|   `-- reviewer.md
`-- fixtures/
    `-- hidden_cases.jsonl
```

`verifier/verifier.md` is analogous to `task.md` for the evaluation side. It
describes what evidence is read, which strategies can score the task, how
rubric dimensions compose, which judge agents are allowed, and what outputs the
runtime must preserve.

`verifier/task.toml` is an optional compatibility projection of
`verifier/verifier.md`; it is not a second native surface. If both files exist,
their canonical projection must match or validation fails closed.

```md theme={null}
---
document_version: "0.3"
verifier:
  name: hidden-patch-and-quality
  default_strategy: deterministic
  strategies:
    deterministic:
      type: script
      command: ./test.sh
    rewardkit:
      type: reward-kit
      root: rewards/
      criteria: rewards/quality/criteria.toml
    judge:
      type: agent-judge
      role: verifier_judge
      model: gpt-5.5
      inputs: [trajectory/acp_trajectory.jsonl, /logs/artifacts/patch.diff]
      isolation: verifier-only
    ors:
      type: ors-episode
      reward_aggregation: last_non_null
      terminal_policy: finished_true
  rubric:
    combine: weighted_sum
    dimensions:
      correctness: {weight: 0.7, source: deterministic}
      maintainability: {weight: 0.2, source: judge}
      evidence_quality: {weight: 0.1, source: rewardkit}
    files:
      human: rubrics/verifier.md
      structured: rubrics/verifier.toml
  outputs:
    reward_text: /logs/verifier/reward.txt
    reward_json: /logs/verifier/reward.json
    details_json: /logs/verifier/reward-details.json
    aggregate_policy:
      field: reward
      fallback: weighted_mean
---

## role:verifier_judge

Grade only the submitted artifact and declared evidence. Do not infer intent
from private oracle files or hidden verifier fixtures.
```

`test.sh` is the minimum executable strategy and the compatibility export target.
Reward Kit-style criteria, ORS-style episode rewards, and AgentBeats-style
assessor agents are verifier strategies. Runtime adapters may write declared
evidence artifacts such as `trajectory/ors-rewards.jsonl`, but the ORS-specific
judge/normalization semantics stay inside verifier scope. They must not leak
into agent prompts, and they must emit either the canonical reward envelope or a
declared multi-metric map.

Verifier packages must be isolated:

* judge models and credentials are verifier-scoped, not agent-visible
* judge prompts live under `verifier/`, not inside the task prompt
* rubrics are separate from judge persona: `rubrics/verifier.md` is the human
  scoring contract, `rubrics/verifier.toml` or JSON is the structured scoring
  contract, and `## role:verifier_judge` is the assessor posture
* hidden fixtures and rubrics are mounted only during verifier execution
* agent-as-judge trajectories are recorded as evidence and audited for input
  leakage
* unsupported verifier strategies fail closed before scoring starts

## Verifier Contract

Native verifier code lives in `verifier/` and is mounted at `/verifier`.
Compatibility `tests/` remains supported and is mounted at `/tests`.

The minimum script verifier contract applies to the selected verifier
entrypoint: native packages with no `verifier/verifier.md` run
`verifier/test.sh`; imported split-layout packages may run `tests/test.sh`.
When `verifier/verifier.md` is present, structural verifier validity follows
the selected strategy. A package can therefore be valid without
`verifier/test.sh` if, for example, the selected Reward Kit runner and criteria
files exist.

* write `/logs/verifier/reward.txt` with one float from `0.0` to `1.0`
* optionally write `/logs/verifier/reward.json` with structured rubric/evidence;
  BenchFlow preserves structured rewards, keeps `reward.txt` as scalar
  compatibility, and can compute a scalar from declared aggregate policies
* prefer exit `0` after writing reward
* treat nonzero verifier exit without a fresh reward file as infrastructure
  failure; a nonzero exit with a fresh reward is a scored task result with
  verifier diagnostics

The richer standard should model verifier inputs explicitly:

```yaml theme={null}
verifier:
  type: test-script
  timeout_sec: 900
benchflow:
  verifier:
    entrypoint: verifier/test.sh
    visibility: hidden
    inputs:
      - path: verifier/test.patch
        kind: test_patch
        visibility: hidden_verifier
    outputs:
      reward_text: /logs/verifier/reward.txt
      reward_json: /logs/verifier/reward.json
    transfer:
      agent_to_verifier:
        - /logs/artifacts/**
      verifier_to_result:
        - /logs/verifier/reward.*
        - /logs/verifier/artifacts/**
```

This covers `tests/test.patch`, Windows entrypoints, artifact-only graders,
separate verifier images, and hidden fixtures without overloading the directory
name.

Separate verifier execution must define image resolution, hidden fixture
mounting, step-level verifier environments, and transfer rules. Hidden verifier
inputs such as `tests/test.patch` or `verifier/test.patch` are mounted only for
the verifier phase; agent logs and workspace files move into verifier scope
only through declared artifact transfer paths.

Target reward precedence:

* `reward.txt` remains the scalar compatibility minimum
* when `reward.json` exists, it should be the authoritative rich reward artifact
* `reward.json` may be an envelope with a numeric `reward`, or a reward-kit
  style multi-metric map with `metrics` plus a declared `aggregate` policy
* if both files exist and `reward.json` has a scalar aggregate, the scalar must
  match `float(reward.txt)` or validation should fail closed
* if `reward.json` is a multi-metric map without a scalar `reward`,
  verifier metadata may declare the aggregate policy used for scalar exports;
  current runtime computes `reward` for `mean`, `weighted_mean`, and
  `weighted_sum`, and `reward.txt` may still carry the scalar compatibility
  value
* `reward-details.json` should be preserved when present
* reward artifacts should preserve structured reserved keys such as `rubric`,
  `items`, `evidence`, `artifacts`, `metadata`, `reason`, `reasons`, `errors`,
  and task-specific payloads such as `metrics`, `regressions`, `participants`,
  `winner`, `raw`, and `debug`

Current runtime now prefers `reward.json` over `reward.txt`, rejects
disagreeing scalar outputs, and preserves a first set of structured reward
fields. `src/benchflow/task/verifier_document.py` parses
`verifier/verifier.md` strategies, rubric metadata, output contracts, and
verifier-scoped role prompts. When `verifier/verifier.md` is present,
`Verifier.verify()` selects its default strategy: `script` runs the declared
command relative to the uploaded verifier directory, `llm-judge` uses the
existing deliverables judge with verifier-local rubric, model, input directory,
and context/context-file overrides, `reward-kit` runs a safe relative
`reward.py` package runner inside verifier scope, writes a
`reward-kit-manifest.json` contract, and, when criteria are declared, parses
those criteria before launch and computes/verifies canonical `reward` from
matching `reward.json.metrics`. `agent-judge` runs a
verifier-scoped judge role over declared evidence inputs. ORS runtime helpers
can normalize tool-output rewards into `trajectory/ors-rewards.jsonl`;
`ors-episode` then reads declared ORS reward evidence, normalizes reward
responses or event streams through the existing ORS adapter, and emits
canonical `reward.json` plus `reward-details.json`.
`reward-details.json` is a named rollout artifact, stale copies are cleared
before verification, script verifiers can preserve it, target-service
verifiers download it with the rest of `/logs/verifier`, and the built-in LLM
judge emits criterion details there. Metrics-only `reward.json` maps can now
use the verifier document's `outputs.aggregate_policy` or selected Reward Kit
criteria policy to compute and persist the canonical `reward`. It still does
not fully match the target: full Reward Kit parity, full OpenReward environment
import/export, and AgentBeats
assessor lifecycles are not yet first-class; selected unsupported strategies
fail closed.

Validation evidence should include parser checks, legacy-vs-native migration
parity, live rollout artifacts, negative-contract failures, and explicit
fail-closed results for parsed fields that the selected runtime cannot honor.

## Oracle Contract

Native oracle code lives in `oracle/` and is mounted at `/oracle`.
Compatibility `solution/` remains supported and is mounted at `/solution`.

The naming change is semantic:

* `oracle` means "held-out reference behavior used to prove solvability"
* `solution` is a compatibility export name for older split layouts

Proposed outbound exporters should map native `oracle/` to foreign
`solution/`. Current runtime supports native and compatibility oracle paths, and
the first split-layout exporter maps native `oracle/` to `solution/`.
Publication-grade oracle equivalence enforcement remains target behavior.

## Assets, Provenance, And Evidence

Native task packages need more than source code. The standard should track
assets as first-class objects:

```yaml theme={null}
benchflow:
  provenance:
    images:
      - field: environment.docker_image
        reference: ghcr.io/org/task-image:2026-06
        digest: sha256:...
        registry: ghcr.io
        provider: modal
        build_source:
          path: environment/Dockerfile
          sha256: "<hash>"
        credential_secret_ref: task-image-pull
        never_persist_credentials: true
  assets:
    - path: environment/dataset.parquet
      visibility: agent
      sha256: "<hash>"
      source:
        url: https://huggingface.co/datasets/org/name
        revision: "<commit>"
      license: apache-2.0
      generated: false
      mount_phase: agent
    - path: verifier/hidden_cases.jsonl
      visibility: hidden_verifier
      sha256: "<hash>"
      mount_phase: verifier
  secrets:
    - name: task-image-pull
      scope: image-pull
      visibility: runtime_secret
      never_persist: true
```

Suggested visibility values:

* `agent`
* `hidden_verifier`
* `hidden_oracle`
* `runtime_secret`
* `external_dataset`
* `evidence_only`

Evidence should prove task validity, not just package shape:

```yaml theme={null}
benchflow:
  evidence:
    oracle_runs:
      required_reward: 1.0
      last_job: jobs/task-standard/oracle/...
      artifact: evidence/calibration/oracle-run.json
    verifier:
      reruns: 5
      flake_rate: 0.0
      report: evidence/calibration/verifier-stability-report.json
    review:
      anti_cheat: passed
      instruction_alignment: passed
      reviewer: benchflow
      artifact: evidence/calibration/review.json
    calibration:
      no_op_reward_max: 0.0
      known_bad_reward_max: 0.2
      partial_solution_range: [0.3, 0.8]
      report: evidence/calibration/calibration-report.json
      human_or_reference_examples:
        - name: gold
          expected_reward: 1.0
          artifact: evidence/calibration/gold-result.json
      judge_agreement:
        required: true
        sample_count: 5
        min_pairwise_agreement: 0.8
    trajectories:
      - path: trajectory/acp_trajectory.jsonl
        kind: acp
        visibility: evidence_only
        sha256: "<hash>"
      - path: trajectory/critique.jsonl
        kind: critique
        visibility: evidence_only
        sha256: "<hash>"
    artifacts:
      - path: evidence/calibration/oracle-run.json
        kind: oracle_run
        visibility: evidence_only
        sha256: "<hash>"
      - path: evidence/calibration/verifier-stability-report.json
        kind: verifier_stability
        visibility: evidence_only
        sha256: "<hash>"
      - path: evidence/calibration/review.json
        kind: acceptance_review
        visibility: evidence_only
        sha256: "<hash>"
      - path: evidence/calibration/calibration-report.json
        kind: calibration_report
        visibility: evidence_only
        sha256: "<hash>"
      - path: evidence/calibration/gold-result.json
        kind: reference_result
        visibility: evidence_only
        sha256: "<hash>"
      - path: artifacts/session.har
        kind: har
        mime_type: application/json
        produced_by: browser
        visibility: evidence_only
        sha256: "<hash>"
```

Borrow the discipline, not the package shape, from METR-style task standards:

* keep BenchFlow document-first rather than Python `TaskFamily`-first
* make asset visibility explicit instead of relying on directory folklore
* declare required secrets/resources with scope and "never persist" semantics
* record oracle/no-op/partial/human baseline scores where available
* keep protected/intermediate scoring behind explicit visibility controls

Valid evidence artifact kinds include `screenshot`, `video`, `har`,
`browser_trace`, `trajectory`, `critique`, `viewer_metadata`, and
`verifier_artifact`.

Oracle and calibration evidence is required for leaderboard-grade tasks and
recommended for all native tasks with hidden tests, subjective rubrics, or
LLM/agent-as-judge scoring.

## Interaction Model

An interaction declares one **mode**. The modes form the Interaction axis:

| Mode                     | Meaning                                                                                               | Runtime                                        |
| ------------------------ | ----------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| `single-shot`            | one agent, one prompt, scored once                                                                    | executable                                     |
| `multi-round`            | oracle access / progressive disclosure across rounds (`BaseUser`)                                     | executable                                     |
| `simulated-user`         | a declarative, model-backed user persona drives turns (not a live wire peer)                          | partial (linear; one active role per scene)    |
| `multi-agent-sequential` | roles hand off in one shared workspace                                                                | partial (sequential handoff only)              |
| `arena-concurrent`       | an assessor agent and the agent-under-test run as live A2A+MCP peers, scored *during* the interaction | target (runtime-deferred; see Open Primitives) |

`arena-concurrent` models a live interaction where the reward is produced *during*
the exchange (by an assessor agent) rather than by a post-hoc verifier. It is
declared in the schema so such tasks parse without loss; its runtime (an A2A bridge
for the agent-under-test plus a concurrently-running assessor) is deferred to a
later milestone (see *Open Primitives and Roadmap*). Note ACP (BenchFlow \<-> agent)
is not A2A (agent \<-> agent); arena needs the A2A leg.

The current executable slice is linear `scenes` that reference `agents.roles`.
The first team handoff subset is intentionally narrow: a document-declared user
loop may execute explicit multi-role scene turns sequentially when
`benchflow.teams.<team>.handoff` declares `mode: sequential`,
`workspace_visibility: shared`, and `trajectory_visibility: none|metadata`.

```yaml theme={null}
agents:
  roles:
    planner:
      agent: claude-agent-acp
      model: claude-sonnet-4-6
    implementer:
      agent: codex-acp
      model: gpt-5.5
      reasoning_effort: high
benchflow:
  teams:
    default:
      handoff:
        mode: sequential
        workspace_visibility: shared
        trajectory_visibility: metadata
```

Richer team semantics such as role membership enforcement, summaries, handoff
artifacts, parallel teams, branch routing, and full trajectory sharing are
parsed as draft surface but must fail closed until a runtime owns them.

Simulated users and nudges should be explicit about runtime type:

```yaml theme={null}
user:
  model: scripted
  stop_rule: satisfied-or-3-rounds
  private_facts:
    hidden_need: reveal only after the solver asks for it
benchflow:
  nudges:
    mode: simulated-user
    nudge_budget: 2
```

This deterministic subset compiles into `RolloutConfig.user` as a
`DocumentNudgeUser`: the public prompt runs first, private facts stay out of the
package metadata and initial solver prompt, and a fact is revealed only after a
targeted clarification question. Bounded model-linear simulated users should
stay equally explicit:

```yaml theme={null}
user:
  model: claude-haiku
  stop_rule: satisfied-or-5-rounds
benchflow:
  nudges:
    mode: simulated-user
    branchable: true
    branch_execution: option-kinds-preserved
    confirmation_policy:
      destructive_actions: human
```

If a runtime cannot compile `user` into a concrete loop, it should fail closed
or mark the field metadata-only rather than silently ignoring the user.
Today the first model-linear slice accepts `claude-*`, `gpt-*`, and
`gemini*`-style models for linear single- or multi-scene simulated users through
`ModelDocumentNudgeUser`. `confirmation_policy: human` installs a fail-closed
ACP permission handler unless the caller supplies an explicit `on_ask_user`
handler, and the ACP `ask_user` bridge preserves both option IDs and option
kinds so reject/allow choices are explicit branchable evidence. Authors may
spell the current executable branch slice as
`branch_execution: option-kinds-preserved`; `branch_execution: forked-snapshot`
fails closed until the user loop is integrated with the Environment snapshot
branch engine. The first sequential shared-workspace team handoff slice records
`scene`, `role`, `handoff_from`, and `handoff_to` metadata per user round.
`branchable` is still not automatic branch execution; interactive approval UI,
parallel teams, handoff artifacts, full trajectory sharing, and
branch/message-routing policy remain fail-closed target work.

## Compatibility

Compatibility must be explicit. A native package can guarantee export only for
the subset the target format supports.

```yaml theme={null}
benchflow:
  compatibility:
    harbor:
      export: full
      emits:
        config: task.toml
        prompt: instruction.md
        oracle: solution/
        verifier: tests/
    pier:
      export: degraded
      losses:
        - benchflow.teams
        - benchflow.nudges
```

Target compatibility rules:

1. Native authoring rejects unknown root config keys.

2. Foreign adapter import preserves unknown `task.toml` keys outside native
   config. The first implementation is `import_task_config_toml()`: strict
   native validation still fails on unknown keys, while compatibility import returns a
   validated `TaskConfig` plus `InboundCompatibility.config_extra`.

3. Legacy-to-native migration writes preserved foreign keys under
   `benchflow.compat.extra`:

   ```yaml theme={null}
   benchflow:
     compat:
       source: harbor
       extra_paths:
         - environment.modal.image
         - steps[0].runner
         - verifier.reward_kit.metric
       extra:
         environment:
           modal:
             image: registry.example.com/task:latest
         steps:
           - runner: harbor-step-runner
         verifier:
           reward_kit:
             metric: exact_match
   ```

4. Split export rehydrates `benchflow.compat.extra` back into `task.toml`
   without overwriting supported native keys. The export report records
   `restored_extension_paths` so compatibility is auditable.

5. `build_harbor_roundtrip_conformance_report()` proves the supported split
   surface across a split -> `task.md` -> split hop: canonical `TaskConfig`,
   normalized prompt, and environment/solution/tests file-map hashes.

6. Export emits a degraded-export report when the target format cannot express
   a native concept. The first implementation is
   `src/benchflow/task/export.py`, which writes a compatibility split layout
   plus `compatibility/export-report.json`.

7. Mixed native/legacy files are structurally invalid when task config,
   prompt, oracle/solution, or verifier/tests aliases drift.

8. Compatibility export should emit `task.toml`, `instruction.md`, `solution/`,
   and `tests/`.

9. BenchFlow import should prefer `task.md` only when compatibility metadata
   proves the legacy files are equivalent.

10. Export reports include selected definition, selected verifier/oracle dirs,
    input/output file hashes, alias collisions, restored foreign extensions,
    and any lost semantics.

11. `tests/` compatibility is path compatibility, not a separate native
    standard. For native packages, `verifier/` is authoritative. For split
    packages, `tests/` remains valid and may contain only `test.sh`.

12. Exporters map the selected native verifier tree to `tests/`; importers may
    preserve split-layout `tests/` without requiring `verifier.md` or rubric
    files. If both `verifier/` and `tests/` exist, validation should compare
    normalized file maps and fail closed on drift.

13. ORS/AgentBeats imports/exports live at the adapter boundary: ORS
    tool-output rewards become declared reward-event artifacts plus a terminal
    aggregate, and AgentBeats assessor agents become verifier strategies rather
    than root task syntax.

Round-trip guarantees are semantic, not byte-exact. Config equivalence should
be checked through canonical `TaskConfig` dumps; prompt equivalence through
normalized prompt text; verifier/oracle equivalence through deterministic
SHA-256 file maps over regular files. Comments, TOML/YAML formatting, and blank
line trivia are not preserved unless a same-format no-op export asks for that.

## Runtime Capability Matrix

Current implementation status:

| Feature                             |   Parse |    Runtime | Next gate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                          |
| ----------------------------------- | ------: | ---------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------- |
| `task.md` prompt                    |     yes |        yes | keep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                          |
| native `verifier/`                  |     yes |        yes | keep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                          |
| native `oracle/`                    |     yes |        yes | keep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                          |
| verifier `script` strategy          |     yes |        yes | keep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                          |
| verifier `llm-judge` strategy       |     yes |        yes | keep                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                          |
| verifier `reward-kit` strategy      |     yes |    partial | safe relative `reward.py` runner executes; declared criteria parse before launch, emit a runtime manifest, require exact metrics, and compute/verify canonical reward; fuller Reward Kit parity remains target work                                                                                                                                                                                                                                                                                                                                                                                 |                                          |
| verifier `agent-judge` strategy     |     yes |    partial | verifier-scoped LLM judge over declared inputs; richer ACP-backed judge agents remain target work                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                          |
| verifier `ors-episode` strategy     |     yes |    partial | runtime helper writes ORS tool-output rewards to `trajectory/ors-rewards.jsonl`; declared reward responses/event streams normalize into `reward.json` and `reward-details.json`; fuller OpenReward environment import/export remains target work                                                                                                                                                                                                                                                                                                                                                    |                                          |
| `agents.roles`                      |     yes |    partial | `TaskRuntimeView` carries parsed scenes; explicit sequential shared-workspace handoff can switch roles through the user loop                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                          |
| `scenes`                            |     yes |    partial | prompt composition compiles; multi-role document-user scenes execute only with explicit turns and supported team handoff                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                          |
| `user` / `## user-persona`          |     yes |    partial | `model: scripted` + string `private_facts` compiles to `DocumentNudgeUser`; bounded model-linear users compile to `ModelDocumentNudgeUser`; linear single- and multi-scene user loops execute when every scene is single-role, or when explicit multi-role turns opt into sequential shared-workspace team handoff; `confirmation_policy: human` gates ACP permissions fail-closed without an explicit handler; `branch_execution: option-kinds-preserved` preserves option IDs and kinds; forked branch execution, interactive approval UI, parallel teams, and rich handoff artifacts fail closed |                                          |
| `benchflow.teams`                   |     yes |    partial | supports exactly one `handoff` with `mode: sequential`, `workspace_visibility: shared`, and \`trajectory\_visibility: none                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | metadata\`; richer team keys fail closed |
| `benchflow:`                        |     raw |         no | typed document schema after v0.3 stabilizes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                          |
| imported `steps`                    |     yes | no/partial | fail closed per sandbox until implemented                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                          |
| root/step artifacts                 |     yes | no/partial | implement collection or fail closed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                          |
| network allowlist                   |     yes | no/partial | per-sandbox capability check                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                          |
| separate verifier env               |     yes | no/partial | materializer plus verifier runner support                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                          |
| Windows / TPU                       |     yes |         no | fail closed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                          |
| healthcheck                         |     yes | no/partial | fail closed until sandbox healthcheck support lands                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                          |
| workdir                             |     yes |    partial | absolute non-root paths are materialized; root/relative paths fail closed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                          |
| `reward.json` precedence            |     yes |    partial | prefer JSON when present and reject both-present mismatches                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                          |
| metrics aggregate policy            |     yes |    partial | `mean`, `weighted_mean`, and `weighted_sum`; richer engines remain target work                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                          |
| `arena-concurrent` interaction (G4) |      no |         no | add interaction-mode schema now; A2A bridge for the agent-under-test + concurrently-running assessor at M2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                          |
| hybrid reward envelope (G1)         | partial |         no | declared cross-surface product/sum of factors; M1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                          |
| GAIN aggregation (G2)               |      no |         no | dynamic live baseline + ceiling; M1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                          |
| leaderboard-submission (G5)         | partial |         no | hosted / hidden external scorer with durable result record; M1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                          |
| RL step-reward (G6)                 |      no |         no | per-action environment reaction; M1/M2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                          |

Today `bench tasks check` is structural by default. `bench tasks check --level
schema` checks only the task authoring entrypoint and prompt parse, so
schema-only fixtures can be validated without pretending to be runnable task
packages. `bench tasks check --sandbox <backend>` runs the sandbox-aware
capability gate for this matrix, and `bench tasks check --level
runtime-capability --sandbox <backend>` names that gate explicitly.
`bench tasks check --level publication-grade` adds the first static native
publication gate: the package must use `task.md`, native `oracle/`, native
`verifier/verifier.md`, rubric files, and selected verifier strategy artifacts
with an explicit `reward_json` output contract. Unknown sandbox names fail
runtime-capability validation instead of becoming no-op checks. `bench tasks
check --level acceptance` adds the first static evidence
gate: `benchflow.evidence` must declare oracle proof, verifier reruns and
flake rate, a verifier stability report with concrete run records, anti-cheat
and instruction-alignment review status, calibration bounds, reference
artifacts, a calibration report with no-op/known-bad/partial/reference cases,
and SHA-256 pins for every primary evidence file. The static gate parses the
declared JSON artifacts and checks that oracle rewards, review status,
calibration examples, calibration report cases, and verifier run records agree
with the declared metadata. `bench tasks check --level acceptance-live --sandbox <backend>` now adds the first executable live-evidence slice:
`benchflow.evidence.acceptance_live.cases` can declare fresh verifier cases
and executable `oracle/solve.sh` reruns that pass through the selected sandbox
and BenchFlow verifier boundary, with reward thresholds checked from live
`reward.txt` / `reward.json` output. `acceptance_live.calibration.from:
calibration.report` can also generate live no-op/known-bad/partial/reference
cases from the static calibration report; generated low/partial cases require
single-line sandbox commands in the report so the checker runs real
perturbations instead of treating missing artifacts as expected verifier
errors. When `acceptance_live.report` is declared, the runner writes a
package-local `acceptance-live-report` JSON artifact plus a `.sha256` sidecar
with run records, reward summaries, task/spec hashes, generated/declared case
sources, staged workspace hash, and secret-safe diagnostic fields such as
`verifier_error_category`, `diagnostic_code`, and `artifact_hint`. Dependency
install flakes point to `verifier/test-stdout.txt` instead of embedding raw
resolver output in the report. For dogfood against checked-in examples that must
not dirty the package, `bench tasks check --level acceptance-live --report-output <path>` writes the report and sidecar to a host path instead of
the package-local path, and `--no-report-write` skips writing the report and
sidecar entirely (validation only); the package-local write is reserved for
intentionally refreshing checked-in evidence. Repeated live cases can declare
`expect.flake_rate_max` to enforce observed flake rate across fresh sandbox
reruns; without that field, each failed rerun fails the gate. Larger repeated
flake campaigns, model/submission metadata, hosted leaderboard publication, and
leaderboard export are still target work. If
`acceptance_live.leaderboard.required: true` is declared, the live report
includes a local `leaderboard_suitability` verdict requiring live runs, all
runs passing, an observed flake rate within
`acceptance_live.leaderboard.max_flake_rate`, oracle/reference proof, and
generated calibration coverage for no-op, known-bad, partial, and reference
cases. A parsed
field that the selected runtime cannot honor is worse than a parse error.

## Architecture Slices

P0: Add `TaskPackage` / `TaskRuntimeView`.

The first runtime-facing slices exist in `src/benchflow/task/runtime_view.py`
and `src/benchflow/task/package.py`. They answer:

* which entrypoint is authoritative
* what prompt goes into `/instruction.md` compatibility materialization
* which verifier/oracle directories are native versus legacy
* which scenes were parsed for execution
* which source hashes and compatibility metadata apply
* which verifier document and selected strategy apply
* which sandbox runtime issues block launch
* which compatibility export report describes target-format loss
* which prompt plan composes base, role, scene, and turn prompts for rollout,
  with redacted document-declared user metadata

The remaining package-boundary work is richer adapter import/export state,
acceptance/calibration validation levels, and freezing selected verifier/user
runtime semantics through launch instead of reparsing at every edge.

P1: Add fail-closed capability checks.

Rollout, verifier, hardening, and adapters should stop parsing fields they do
not execute without surfacing a validation result.

This includes explicit gates for `task.md` plus split-file drift,
`verifier/` plus `tests/`, and `oracle/` plus `solution/`.

The first module is `src/benchflow/task/runtime_capabilities.py` with a pure
validator:

```python theme={null}
validate_task_runtime_support(task, *, sandbox, task_dir) -> list[UnsupportedTaskFeature]
```

It reports stable config paths and reasons for unsupported `steps`, root/step
`artifacts`, allowlists, separate verifier environments, Windows, TPU,
healthchecks, unsafe workdirs, document-only `user`/`benchflow` runtime
semantics, and non-`main` verifier services on backends that cannot run them.
It is wired into `bench tasks check --sandbox <backend>` and the shared sandbox
factory used by rollouts and `Environment.from_task()`. Unsupported parsed
semantics now raise `UnsupportedTaskFeatureError` before Docker, Daytona, or
Modal construction. Safe absolute non-root `environment.workdir` values are
materialized before agent and verifier setup.

P2: Split native and adapter validation modes.

Native authoring should be strict. Foreign import should preserve and warn.

P3: Type the `benchflow:` namespace.

Start with `document_version`, `compatibility`, `provenance`, `assets`,
`secrets`, `evidence`, `teams`, `nudges`, `prompt`, `agent_policy`, and
`runtime_policy`.

P4: Extend exporters.

The first `bench tasks export` path exports to a compatibility split layout
with explicit loss reports, backed by `export_task_to_split_layout()`. Extend
the same reporting discipline to external benchmark datasets and same-format
no-op exports.
