The corpus is a locked regulatory-triage slice with controlled coverage across loci, chromosomes, assay families, cell and tissue contexts, and biology systems, so a single-locus result cannot pass as cross-locus evidence.
Each task evaluates if an agent can pick the noncoding edit that a hidden verifier rewards for a given regulatory readout, when no public signal gives the answer away.
What makes the set hard is its coverage across biology, its resistance to leakage, and its resistance to shortcuts. Each axis below is an audited cap.
| Atlas axis | Coverage | Why it matters |
|---|---|---|
| Task mechanic | 165 regulatory-triage tasks | Keeps the claim narrow, intervention prioritization under hidden regulatory verification. |
| Loci | 11 non-APOE loci, 15 tasks each | Stops a single-locus story from passing as cross-locus evidence. |
| Chromosomes | 9 chromosomes | Reduces chromosome-local redundancy and repetitive-context artifacts. |
| Biology systems | 11 systems | Forces locus and assay reasoning across distinct regulatory contexts. |
| Assay families | ChIP 60 · RNA-seq 45 · ATAC 30 · CAGE 30 | Prevents treating every objective as one scalar sequence-model problem. |
| Dominant edit types | DEL 128 · INS 23 · SNV 14 | Action-type imbalance is explicit; no overclaim of edit-family breadth. |
Per-locus outcome over 90 runs (6 agents). Some loci are easier for agents under this hidden verifier than others, which is a fact about the agents and the verifier, not a claim that the biology is easier at any locus.
| Locus | Gene | Biology system | Outcome · 90 runs | Solve rate |
|---|
Each task is a terminal environment the agent works inside, with public files, per-candidate feature tables, a public validator that only checks the answer is well-formed, an offline workspace, and a hidden grader the agent never reads. The split between the public surface and the hidden grader is the benchmark object.
Candidate noncoding edits at a locus, plus the public signal bundle a domain scientist could assemble.
The private objective is a model-backed regulatory-effect estimate for the requested assay and cell type. It is the hidden verifier that admits and grades tasks, not a wet-lab measurement and not biological ground truth, and it defines the target without exposing scores, ranks, or thresholds on the public surface.
Public sequence priors. The agent's per-candidate signals come from open DNA foundation models. Carbon (Hugging Face, a 1T-token genomic language model) and Evo2 (Arc Institute, a genome model with up to 1 Mbp context over 8.8T tokens) supply variant-effect and sequence signals and their disagreement. Motif/PWM and sequence-context features come from standard annotations. The hidden objective is model-backed and is never exposed.
Every task manifest row reports cheap gate-pass status, oracle replay, leakage audit, public-support audit, and a wrong-answer negative control before admission.
| Gate | Status | Detail |
|---|
Generated by the release exporter outside hidden verifier and grading paths, then scanned. Distributed with a SHA-256 checksum.
2,475 task files across benchmark_audit, benchmark_corpus_v1, launch_readiness, manifest, and tasks. Zero scan findings.
The release-safe stratification, outcomes, behavior codebook, and near-miss review that every page on this site renders from.