Task atlas

One task mechanic, deliberately stratified.

The corpus is a locked regulatory-triage slice with controlled coverage across loci, chromosomes, assay families, cell and tissue contexts, and biology systems, so a single-locus result cannot pass as cross-locus evidence.

165 tasks|2,475 task files|0 scan findings|open download
Coverage axes

What the benchmark covers.

Each task evaluates if an agent can pick the noncoding edit that a hidden verifier rewards for a given regulatory readout, when no public signal gives the answer away.

What makes the set hard is its coverage across biology, its resistance to leakage, and its resistance to shortcuts. Each axis below is an audited cap.

Atlas axisCoverageWhy it matters
Task mechanic165 regulatory-triage tasksKeeps the claim narrow, intervention prioritization under hidden regulatory verification.
Loci11 non-APOE loci, 15 tasks eachStops a single-locus story from passing as cross-locus evidence.
Chromosomes9 chromosomesReduces chromosome-local redundancy and repetitive-context artifacts.
Biology systems11 systemsForces locus and assay reasoning across distinct regulatory contexts.
Assay familiesChIP 60 · RNA-seq 45 · ATAC 30 · CAGE 30Prevents treating every objective as one scalar sequence-model problem.
Dominant edit typesDEL 128 · INS 23 · SNV 14Action-type imbalance is explicit; no overclaim of edit-family breadth.
Locus map

Each locus contributes 15 tasks, and a different difficulty.

Per-locus outcome over 90 runs (6 agents). Some loci are easier for agents under this hidden verifier than others, which is a fact about the agents and the verifier, not a claim that the biology is easier at any locus.

LocusGeneBiology systemOutcome · 90 runsSolve rate
Solve Near-miss Failure
Public schema

What the agent sees, and what it never sees.

Each task is a terminal environment the agent works inside, with public files, per-candidate feature tables, a public validator that only checks the answer is well-formed, an offline workspace, and a hidden grader the agent never reads. The split between the public surface and the hidden grader is the benchmark object.

Public evidence channels

Candidate noncoding edits at a locus, plus the public signal bundle a domain scientist could assemble.

CarbonEvo2 disagreementmotif / PWM sequence contextedit geometrymutagenesis embedding shiftcompositionlocus / assay metadata

Hidden verification

The private objective is a model-backed regulatory-effect estimate for the requested assay and cell type. It is the hidden verifier that admits and grades tasks, not a wet-lab measurement and not biological ground truth, and it defines the target without exposing scores, ranks, or thresholds on the public surface.

hidden scoresthresholdsranks cache IDssolutionsraw traces

Public sequence priors. The agent's per-candidate signals come from open DNA foundation models. Carbon (Hugging Face, a 1T-token genomic language model) and Evo2 (Arc Institute, a genome model with up to 1 Mbp context over 8.8T tokens) supply variant-effect and sequence signals and their disagreement. Motif/PWM and sequence-context features come from standard annotations. The hidden objective is model-backed and is never exposed.

Gate status

Release safety is part of the dataset.

Every task manifest row reports cheap gate-pass status, oracle replay, leakage audit, public-support audit, and a wrong-answer negative control before admission.

GateStatusDetail
Release bundle

What ships in the bundle.

Generated by the release exporter outside hidden verifier and grading paths, then scanned. Distributed with a SHA-256 checksum.