BenchFlow Task Package Standard
Status: v0.6.2 — current stable task package standard (2026-06-14) This document defines the direction for BenchFlow-native task packages. The short version:task.md is the native authoring entrypoint; oracle/ and
verifier/ are the BenchFlow-native names for held-out reference behavior and
reward checks. Split-layout names such as solution/ and tests/ remain
compatibility names for migration and export only.
The standard is intentionally split into three views:
| View | Purpose | Owner |
|---|---|---|
| Authoring document | What humans and generators write: one task.md plus sidecar dirs | TaskDocument |
| Runtime task view | What rollout, verifier, hardening, provenance, and trajectories consume | TaskRuntimeView plus first TaskPackage boundary |
| Foreign adapter view | What external benchmark formats and hosted environments import/export | adapters |
Goals and scope
A BenchFlow task is onetask.md that selects a mode on each of three planes
(see Planes below): how the environment is built, how the agent interacts, and
how the result is scored. The standard has three goals:
- Native authoring — humans and generators write one
task.md(plus sidecar dirs) as the primary surface. - Interoperability — existing split-layout formats (
task.toml+instruction.md+solution/+tests/) import directly, and packages export back to that layout with an explicit, honest loss report when a native concept has no equivalent. - Coverage — the schema can express the full range of eval shapes (single-shot, multi-round, simulated-user, multi-agent, and live-arena interaction; workspace, trajectory, rubric, judge, and leaderboard reward), even where the runtime for a given mode lands in a later milestone (see Open Primitives and Roadmap).
Field discipline
Every normative standard field must map to a mode on one of the three planes (below) or to a concrete BenchFlow runtime need. Fields without that mapping belong undermetadata or in an adapter, not in the standard.
Native Layout
| Native | Compatibility / foreign export | Meaning |
|---|---|---|
task.md | task.toml + instruction.md | Task config plus prompt material |
oracle/ | solution/ | Held-out reference implementation |
verifier/ | tests/ | Verifier package, reward code, hidden checks, rubrics, and judges |
- if
task.mdexists, it is the authoritative task definition - if
verifier/exists, it is the authoritative verifier directory; an empty or invalidverifier/does not fall back totests/ - if
oracle/exists, it is the authoritative oracle directory; an empty or invalidoracle/does not fall back tosolution/ - duplicate alias trees must be byte-identical after normalized traversal or validation should report a collision
task.md as
the only authoritative entrypoint.
Versioning
There are two versions:schema_version is for the runtime config model shared with compatibility
imports. benchflow.document_version is for document-only concepts such as
teams, prompt composition, agent policy, runtime policy, private assets,
provenance, evidence, nudges, and export policy.
Root Frontmatter
The root frontmatter has three classes of keys. BenchFlow config keys are modeled byTaskConfig and must be rejected
when unknown in native authoring mode:
schema_version/versiontaskmetadataagentverifierenvironmentoracle(validation alias:solution)sourceartifactsstepsmulti_step_reward_strategy
TaskDocument:
agentsscenesuser
benchflow
benchflow: until they have a stable interface.
Native authoring should use oracle. Importers may accept the compatibility
solution alias, but a config that contains both names is invalid.
Prompt Body
Thetask.md body is the base prompt — free-form markdown, exactly like a
SKILL.md body. No ## prompt heading is required: if the body carries no
reserved section headings, the entire body is the prompt. The common single-shot
task is just frontmatter plus prose, so a bespoke benchmark ports by dropping its
existing instruction text in as the body with no markup.
Tasks that need more than a base prompt — multiple roles, multiple scenes, or a
simulated-user persona — author each as its own free-form file under
prompts/, so no body ever carries reserved-heading ceremony:
| File | Meaning |
|---|---|
prompts/role.<name>.md | Role prompt |
prompts/scene.<name>.md | Scene prompt |
prompts/user-persona.md | Simulated user persona |
## prompt, ## role:<name>, ## scene:<name>, and ## user-persona
headings; these import losslessly and normalize to the file layout. Sidecar files
take precedence over a heading of the same name. New tasks should prefer the files.
Default runtime precedence remains a simple fallback:
- inline turn prompt
- scene prompt
- role prompt
- base prompt
TaskPackage prompt plan now compiles append and explicit replace
policies deterministically. That avoids losing role guardrails when a scene
prompt is present. RolloutConfig consumes the compiled plan for task.md scene
execution when explicit CLI/SDK prompts do not override the document.
Agent And Runtime Policy
Agent isolation is distinct from task environment networking. A no-search task can still need dependency, LLM, or provider egress outside the sandbox; a no-network task can still need the agent harness to call its model.agent_policy, runtime_policy, private mounts, registry secrets,
GPU types, phase-specific network overrides, or persistent state semantics must
fail closed before launch for the selected sandbox. No-search proof must come
from both launch policy and trajectory audit; it is not implied by
network_mode: no-network.
Planes
The package has six planes. Keeping them separate is the core abstraction.| Plane | Owns | Should not own |
|---|---|---|
| Package | identity, versions, provenance, source hashes, export policy | sandbox execution |
| Runtime | environment image, resources, network policy, setup/reset/readiness, mounts | prompt orchestration |
| Interaction | agents, roles, teams, scenes, turns, user/nudge loops, handoff policy | verifier checkpoints |
| Verifier | verifier package entrypoint, reward file contract, hidden fixtures, scorer type, rubrics, judge roles, separate verifier env | agent prompt text |
| Oracle | held-out reference implementation and oracle-only env | tests/verifier code |
| Evidence | validation runs, flake data, anti-cheat review, artifact hashes, leaderboard metadata | mutable task source |
steps are verifier/runtime checkpoints. BenchFlow scenes are
interaction checkpoints. They are orthogonal. A task can have both, but a
runtime must define how they compose before executing them.
Of the six planes, three are the authoring axes an author selects a mode from
— Runtime (the environment), Interaction, and Verifier — while the remaining three
(Package, Oracle, Evidence) are supporting planes. A task is one mode-selection
per axis: environment × interaction-mode × verifier-strategy. New normative
fields must add or compose a mode on one of these axes; otherwise they belong in
metadata or an adapter.
Presets And Profiles
Two different reuse mechanisms were historically both called “profile,” which is why the mental model felt overloaded. v0.6 separates them:preset— authoring sugar. A named bundle of defaults thatbench tasks normalizeexpands into the canonical contract and then discards; it has no runtime existence. Presets are the lightweight-authoring path: a tinytask.mdbecomes a full contract. Examples:code-change,acceptance-live,harbor-compatible. (The current implementation still spells the preset key asprofile:; M0 renames it topreset:and keepsprofile:as a deprecated alias.)*-profile— runtime-real shared config. A benchmark-level object that many tasks reference and that persists intoTaskRuntimeView, one per authoring axis:environment-profile— a shared world reused across tasks (e.g. one clawsbench service catalog referenced by all 44 tasks)agent-profile— shared harness / model / role wiringverifier-profile— shared reward strategy and rubric
preset;
if removing it changes what runs, it is a *-profile. Presets normalize away;
profiles are inherited. *-profile resolution is M1 runtime work; preset exists
today.
Verifier Package
The verifier is a peer package, not just atests/ directory. Native verifier
packages should have their own entry document:
verifier/verifier.md is analogous to task.md for the evaluation side. It
describes what evidence is read, which strategies can score the task, how
rubric dimensions compose, which judge agents are allowed, and what outputs the
runtime must preserve.
verifier/task.toml is an optional compatibility projection of
verifier/verifier.md; it is not a second native surface. If both files exist,
their canonical projection must match or validation fails closed.
test.sh is the minimum executable strategy and the compatibility export target.
Reward Kit-style criteria, ORS-style episode rewards, and AgentBeats-style
assessor agents are verifier strategies. Runtime adapters may write declared
evidence artifacts such as trajectory/ors-rewards.jsonl, but the ORS-specific
judge/normalization semantics stay inside verifier scope. They must not leak
into agent prompts, and they must emit either the canonical reward envelope or a
declared multi-metric map.
Verifier packages must be isolated:
- judge models and credentials are verifier-scoped, not agent-visible
- judge prompts live under
verifier/, not inside the task prompt - rubrics are separate from judge persona:
rubrics/verifier.mdis the human scoring contract,rubrics/verifier.tomlor JSON is the structured scoring contract, and## role:verifier_judgeis the assessor posture - hidden fixtures and rubrics are mounted only during verifier execution
- agent-as-judge trajectories are recorded as evidence and audited for input leakage
- unsupported verifier strategies fail closed before scoring starts
Verifier Contract
Native verifier code lives inverifier/ and is mounted at /verifier.
Compatibility tests/ remains supported and is mounted at /tests.
The minimum script verifier contract applies to the selected verifier
entrypoint: native packages with no verifier/verifier.md run
verifier/test.sh; imported split-layout packages may run tests/test.sh.
When verifier/verifier.md is present, structural verifier validity follows
the selected strategy. A package can therefore be valid without
verifier/test.sh if, for example, the selected Reward Kit runner and criteria
files exist.
- write
/logs/verifier/reward.txtwith one float from0.0to1.0 - optionally write
/logs/verifier/reward.jsonwith structured rubric/evidence; BenchFlow preserves structured rewards, keepsreward.txtas scalar compatibility, and can compute a scalar from declared aggregate policies - prefer exit
0after writing reward - treat nonzero verifier exit without a fresh reward file as infrastructure failure; a nonzero exit with a fresh reward is a scored task result with verifier diagnostics
tests/test.patch, Windows entrypoints, artifact-only graders,
separate verifier images, and hidden fixtures without overloading the directory
name.
Separate verifier execution must define image resolution, hidden fixture
mounting, step-level verifier environments, and transfer rules. Hidden verifier
inputs such as tests/test.patch or verifier/test.patch are mounted only for
the verifier phase; agent logs and workspace files move into verifier scope
only through declared artifact transfer paths.
Target reward precedence:
reward.txtremains the scalar compatibility minimum- when
reward.jsonexists, it should be the authoritative rich reward artifact reward.jsonmay be an envelope with a numericreward, or a reward-kit style multi-metric map withmetricsplus a declaredaggregatepolicy- if both files exist and
reward.jsonhas a scalar aggregate, the scalar must matchfloat(reward.txt)or validation should fail closed - if
reward.jsonis a multi-metric map without a scalarreward, verifier metadata may declare the aggregate policy used for scalar exports; current runtime computesrewardformean,weighted_mean, andweighted_sum, andreward.txtmay still carry the scalar compatibility value reward-details.jsonshould be preserved when present- reward artifacts should preserve structured reserved keys such as
rubric,items,evidence,artifacts,metadata,reason,reasons,errors, and task-specific payloads such asmetrics,regressions,participants,winner,raw, anddebug
reward.json over reward.txt, rejects
disagreeing scalar outputs, and preserves a first set of structured reward
fields. src/benchflow/task/verifier_document.py parses
verifier/verifier.md strategies, rubric metadata, output contracts, and
verifier-scoped role prompts. When verifier/verifier.md is present,
Verifier.verify() selects its default strategy: script runs the declared
command relative to the uploaded verifier directory, llm-judge uses the
existing deliverables judge with verifier-local rubric, model, input directory,
and context/context-file overrides, reward-kit runs a safe relative
reward.py package runner inside verifier scope, writes a
reward-kit-manifest.json contract, and, when criteria are declared, parses
those criteria before launch and computes/verifies canonical reward from
matching reward.json.metrics. agent-judge runs a
verifier-scoped judge role over declared evidence inputs. ORS runtime helpers
can normalize tool-output rewards into trajectory/ors-rewards.jsonl;
ors-episode then reads declared ORS reward evidence, normalizes reward
responses or event streams through the existing ORS adapter, and emits
canonical reward.json plus reward-details.json.
reward-details.json is a named rollout artifact, stale copies are cleared
before verification, script verifiers can preserve it, target-service
verifiers download it with the rest of /logs/verifier, and the built-in LLM
judge emits criterion details there. Metrics-only reward.json maps can now
use the verifier document’s outputs.aggregate_policy or selected Reward Kit
criteria policy to compute and persist the canonical reward. It still does
not fully match the target: full Reward Kit parity, full OpenReward environment
import/export, and AgentBeats
assessor lifecycles are not yet first-class; selected unsupported strategies
fail closed.
Validation evidence should include parser checks, legacy-vs-native migration
parity, live rollout artifacts, negative-contract failures, and explicit
fail-closed results for parsed fields that the selected runtime cannot honor.
Oracle Contract
Native oracle code lives inoracle/ and is mounted at /oracle.
Compatibility solution/ remains supported and is mounted at /solution.
The naming change is semantic:
oraclemeans “held-out reference behavior used to prove solvability”solutionis a compatibility export name for older split layouts
oracle/ to foreign
solution/. Current runtime supports native and compatibility oracle paths, and
the first split-layout exporter maps native oracle/ to solution/.
Publication-grade oracle equivalence enforcement remains target behavior.
Assets, Provenance, And Evidence
Native task packages need more than source code. The standard should track assets as first-class objects:agenthidden_verifierhidden_oracleruntime_secretexternal_datasetevidence_only
- keep BenchFlow document-first rather than Python
TaskFamily-first - make asset visibility explicit instead of relying on directory folklore
- declare required secrets/resources with scope and “never persist” semantics
- record oracle/no-op/partial/human baseline scores where available
- keep protected/intermediate scoring behind explicit visibility controls
screenshot, video, har,
browser_trace, trajectory, critique, viewer_metadata, and
verifier_artifact.
Oracle and calibration evidence is required for leaderboard-grade tasks and
recommended for all native tasks with hidden tests, subjective rubrics, or
LLM/agent-as-judge scoring.
Interaction Model
An interaction declares one mode. The modes form the Interaction axis:| Mode | Meaning | Runtime |
|---|---|---|
single-shot | one agent, one prompt, scored once | executable |
multi-round | oracle access / progressive disclosure across rounds (BaseUser) | executable |
simulated-user | a declarative, model-backed user persona drives turns (not a live wire peer) | partial (linear; one active role per scene) |
multi-agent-sequential | roles hand off in one shared workspace | partial (sequential handoff only) |
arena-concurrent | an assessor agent and the agent-under-test run as live A2A+MCP peers, scored during the interaction | target (runtime-deferred; see Open Primitives) |
arena-concurrent models a live interaction where the reward is produced during
the exchange (by an assessor agent) rather than by a post-hoc verifier. It is
declared in the schema so such tasks parse without loss; its runtime (an A2A bridge
for the agent-under-test plus a concurrently-running assessor) is deferred to a
later milestone (see Open Primitives and Roadmap). Note ACP (BenchFlow <-> agent)
is not A2A (agent <-> agent); arena needs the A2A leg.
The current executable slice is linear scenes that reference agents.roles.
The first team handoff subset is intentionally narrow: a document-declared user
loop may execute explicit multi-role scene turns sequentially when
benchflow.teams.<team>.handoff declares mode: sequential,
workspace_visibility: shared, and trajectory_visibility: none|metadata.
RolloutConfig.user as a
DocumentNudgeUser: the public prompt runs first, private facts stay out of the
package metadata and initial solver prompt, and a fact is revealed only after a
targeted clarification question. Bounded model-linear simulated users should
stay equally explicit:
user into a concrete loop, it should fail closed
or mark the field metadata-only rather than silently ignoring the user.
Today the first model-linear slice accepts claude-*, gpt-*, and
gemini*-style models for linear single- or multi-scene simulated users through
ModelDocumentNudgeUser. confirmation_policy: human installs a fail-closed
ACP permission handler unless the caller supplies an explicit on_ask_user
handler, and the ACP ask_user bridge preserves both option IDs and option
kinds so reject/allow choices are explicit branchable evidence. Authors may
spell the current executable branch slice as
branch_execution: option-kinds-preserved; branch_execution: forked-snapshot
fails closed until the user loop is integrated with the Environment snapshot
branch engine. The first sequential shared-workspace team handoff slice records
scene, role, handoff_from, and handoff_to metadata per user round.
branchable is still not automatic branch execution; interactive approval UI,
parallel teams, handoff artifacts, full trajectory sharing, and
branch/message-routing policy remain fail-closed target work.
Compatibility
Compatibility must be explicit. A native package can guarantee export only for the subset the target format supports.- Native authoring rejects unknown root config keys.
-
Foreign adapter import preserves unknown
task.tomlkeys outside native config. The first implementation isimport_task_config_toml(): strict native validation still fails on unknown keys, while compatibility import returns a validatedTaskConfigplusInboundCompatibility.config_extra. -
Legacy-to-native migration writes preserved foreign keys under
benchflow.compat.extra: -
Split export rehydrates
benchflow.compat.extraback intotask.tomlwithout overwriting supported native keys. The export report recordsrestored_extension_pathsso compatibility is auditable. -
build_harbor_roundtrip_conformance_report()proves the supported split surface across a split ->task.md-> split hop: canonicalTaskConfig, normalized prompt, and environment/solution/tests file-map hashes. -
Export emits a degraded-export report when the target format cannot express
a native concept. The first implementation is
src/benchflow/task/export.py, which writes a compatibility split layout pluscompatibility/export-report.json. - Mixed native/legacy files are structurally invalid when task config, prompt, oracle/solution, or verifier/tests aliases drift.
-
Compatibility export should emit
task.toml,instruction.md,solution/, andtests/. -
BenchFlow import should prefer
task.mdonly when compatibility metadata proves the legacy files are equivalent. - Export reports include selected definition, selected verifier/oracle dirs, input/output file hashes, alias collisions, restored foreign extensions, and any lost semantics.
-
tests/compatibility is path compatibility, not a separate native standard. For native packages,verifier/is authoritative. For split packages,tests/remains valid and may contain onlytest.sh. -
Exporters map the selected native verifier tree to
tests/; importers may preserve split-layouttests/without requiringverifier.mdor rubric files. If bothverifier/andtests/exist, validation should compare normalized file maps and fail closed on drift. - ORS/AgentBeats imports/exports live at the adapter boundary: ORS tool-output rewards become declared reward-event artifacts plus a terminal aggregate, and AgentBeats assessor agents become verifier strategies rather than root task syntax.
TaskConfig dumps; prompt equivalence through
normalized prompt text; verifier/oracle equivalence through deterministic
SHA-256 file maps over regular files. Comments, TOML/YAML formatting, and blank
line trivia are not preserved unless a same-format no-op export asks for that.
Runtime Capability Matrix
Current implementation status:| Feature | Parse | Runtime | Next gate | |
|---|---|---|---|---|
task.md prompt | yes | yes | keep | |
native verifier/ | yes | yes | keep | |
native oracle/ | yes | yes | keep | |
verifier script strategy | yes | yes | keep | |
verifier llm-judge strategy | yes | yes | keep | |
verifier reward-kit strategy | yes | partial | safe relative reward.py runner executes; declared criteria parse before launch, emit a runtime manifest, require exact metrics, and compute/verify canonical reward; fuller Reward Kit parity remains target work | |
verifier agent-judge strategy | yes | partial | verifier-scoped LLM judge over declared inputs; richer ACP-backed judge agents remain target work | |
verifier ors-episode strategy | yes | partial | runtime helper writes ORS tool-output rewards to trajectory/ors-rewards.jsonl; declared reward responses/event streams normalize into reward.json and reward-details.json; fuller OpenReward environment import/export remains target work | |
agents.roles | yes | partial | TaskRuntimeView carries parsed scenes; explicit sequential shared-workspace handoff can switch roles through the user loop | |
scenes | yes | partial | prompt composition compiles; multi-role document-user scenes execute only with explicit turns and supported team handoff | |
user / ## user-persona | yes | partial | model: scripted + string private_facts compiles to DocumentNudgeUser; bounded model-linear users compile to ModelDocumentNudgeUser; linear single- and multi-scene user loops execute when every scene is single-role, or when explicit multi-role turns opt into sequential shared-workspace team handoff; confirmation_policy: human gates ACP permissions fail-closed without an explicit handler; branch_execution: option-kinds-preserved preserves option IDs and kinds; forked branch execution, interactive approval UI, parallel teams, and rich handoff artifacts fail closed | |
benchflow.teams | yes | partial | supports exactly one handoff with mode: sequential, workspace_visibility: shared, and `trajectory_visibility: none | metadata`; richer team keys fail closed |
benchflow: | raw | no | typed document schema after v0.3 stabilizes | |
imported steps | yes | no/partial | fail closed per sandbox until implemented | |
| root/step artifacts | yes | no/partial | implement collection or fail closed | |
| network allowlist | yes | no/partial | per-sandbox capability check | |
| separate verifier env | yes | no/partial | materializer plus verifier runner support | |
| Windows / TPU | yes | no | fail closed | |
| healthcheck | yes | no/partial | fail closed until sandbox healthcheck support lands | |
| workdir | yes | partial | absolute non-root paths are materialized; root/relative paths fail closed | |
reward.json precedence | yes | partial | prefer JSON when present and reject both-present mismatches | |
| metrics aggregate policy | yes | partial | mean, weighted_mean, and weighted_sum; richer engines remain target work | |
arena-concurrent interaction (G4) | no | no | add interaction-mode schema now; A2A bridge for the agent-under-test + concurrently-running assessor at M2 | |
| hybrid reward envelope (G1) | partial | no | declared cross-surface product/sum of factors; M1 | |
| GAIN aggregation (G2) | no | no | dynamic live baseline + ceiling; M1 | |
| leaderboard-submission (G5) | partial | no | hosted / hidden external scorer with durable result record; M1 | |
| RL step-reward (G6) | no | no | per-action environment reaction; M1/M2 |
bench tasks check is structural by default. bench tasks check --level schema checks only the task authoring entrypoint and prompt parse, so
schema-only fixtures can be validated without pretending to be runnable task
packages. bench tasks check --sandbox <backend> runs the sandbox-aware
capability gate for this matrix, and bench tasks check --level runtime-capability --sandbox <backend> names that gate explicitly.
bench tasks check --level publication-grade adds the first static native
publication gate: the package must use task.md, native oracle/, native
verifier/verifier.md, rubric files, and selected verifier strategy artifacts
with an explicit reward_json output contract. Unknown sandbox names fail
runtime-capability validation instead of becoming no-op checks. bench tasks check --level acceptance adds the first static evidence
gate: benchflow.evidence must declare oracle proof, verifier reruns and
flake rate, a verifier stability report with concrete run records, anti-cheat
and instruction-alignment review status, calibration bounds, reference
artifacts, a calibration report with no-op/known-bad/partial/reference cases,
and SHA-256 pins for every primary evidence file. The static gate parses the
declared JSON artifacts and checks that oracle rewards, review status,
calibration examples, calibration report cases, and verifier run records agree
with the declared metadata. bench tasks check --level acceptance-live --sandbox <backend> now adds the first executable live-evidence slice:
benchflow.evidence.acceptance_live.cases can declare fresh verifier cases
and executable oracle/solve.sh reruns that pass through the selected sandbox
and BenchFlow verifier boundary, with reward thresholds checked from live
reward.txt / reward.json output. acceptance_live.calibration.from: calibration.report can also generate live no-op/known-bad/partial/reference
cases from the static calibration report; generated low/partial cases require
single-line sandbox commands in the report so the checker runs real
perturbations instead of treating missing artifacts as expected verifier
errors. When acceptance_live.report is declared, the runner writes a
package-local acceptance-live-report JSON artifact plus a .sha256 sidecar
with run records, reward summaries, task/spec hashes, generated/declared case
sources, staged workspace hash, and secret-safe diagnostic fields such as
verifier_error_category, diagnostic_code, and artifact_hint. Dependency
install flakes point to verifier/test-stdout.txt instead of embedding raw
resolver output in the report. For dogfood against checked-in examples that must
not dirty the package, bench tasks check --level acceptance-live --report-output <path> writes the report and sidecar to a host path instead of
the package-local path, and --no-report-write skips writing the report and
sidecar entirely (validation only); the package-local write is reserved for
intentionally refreshing checked-in evidence. Repeated live cases can declare
expect.flake_rate_max to enforce observed flake rate across fresh sandbox
reruns; without that field, each failed rerun fails the gate. Larger repeated
flake campaigns, model/submission metadata, hosted leaderboard publication, and
leaderboard export are still target work. If
acceptance_live.leaderboard.required: true is declared, the live report
includes a local leaderboard_suitability verdict requiring live runs, all
runs passing, an observed flake rate within
acceptance_live.leaderboard.max_flake_rate, oracle/reference proof, and
generated calibration coverage for no-op, known-bad, partial, and reference
cases. A parsed
field that the selected runtime cannot honor is worse than a parse error.
Architecture Slices
P0: AddTaskPackage / TaskRuntimeView.
The first runtime-facing slices exist in src/benchflow/task/runtime_view.py
and src/benchflow/task/package.py. They answer:
- which entrypoint is authoritative
- what prompt goes into
/instruction.mdcompatibility materialization - which verifier/oracle directories are native versus legacy
- which scenes were parsed for execution
- which source hashes and compatibility metadata apply
- which verifier document and selected strategy apply
- which sandbox runtime issues block launch
- which compatibility export report describes target-format loss
- which prompt plan composes base, role, scene, and turn prompts for rollout, with redacted document-declared user metadata
task.md plus split-file drift,
verifier/ plus tests/, and oracle/ plus solution/.
The first module is src/benchflow/task/runtime_capabilities.py with a pure
validator:
steps, root/step
artifacts, allowlists, separate verifier environments, Windows, TPU,
healthchecks, unsafe workdirs, document-only user/benchflow runtime
semantics, and non-main verifier services on backends that cannot run them.
It is wired into bench tasks check --sandbox <backend> and the shared sandbox
factory used by rollouts and Environment.from_task(). Unsupported parsed
semantics now raise UnsupportedTaskFeatureError before Docker, Daytona, or
Modal construction. Safe absolute non-root environment.workdir values are
materialized before agent and verifier setup.
P2: Split native and adapter validation modes.
Native authoring should be strict. Foreign import should preserve and warn.
P3: Type the benchflow: namespace.
Start with document_version, compatibility, provenance, assets,
secrets, evidence, teams, nudges, prompt, agent_policy, and
runtime_policy.
P4: Extend exporters.
The first bench tasks export path exports to a compatibility split layout
with explicit loss reports, backed by export_task_to_split_layout(). Extend
the same reporting discipline to external benchmark datasets and same-format
no-op exports.