Across six frontier models the failure repeats. The agent reads the public evidence, states the rule for rejecting an attractive but wrong signal, then commits to that signal when it has to choose. The strict solve rate is 11.3%, and no model clears 15%.
Each family is balanced across 11 loci, 15 tasks each, so a family result is not a single-locus artifact. Counts are six-provider runs.
| Workflow family | Solve · near · fail | Solve rate | Failure shape |
|---|
Promoter-output triage, the CAGE family, is the hardest at 5.6%, yet it produces 77 near-misses across 180 runs. Agents correctly favor the edit near the promoter and still miss the transcription start site CAGE scores. The family agents solve least is the one where the most salient public signal sits farthest from the requested readout.
This mirrors the scBench result in single-cell analysis, where agents run every plausible step and miss the one that decides the outcome. The question the trace answers is not only whether the final edit is correct, but whether the agent weighed the evidence specific to this readout against the proxy that looks strong everywhere.
All runs pass the public and grading interface, so non-solves are scientific search failures rather than packaging breakdowns. Of the 252 valid near-misses, 240 carry a quality label.
Of the 240 labeled near-misses, 32 are scientifically mature, weighing alternatives and treating the hidden objective as uncertain. 153 do real biological search but get pulled to the obvious public signal at the final choice. 55 lean on a public ranking from the start with little search. Public scalar collapse drives 189 of the 252 near-misses, and the same pull accounts for 640 of the 878 non-solves overall.
A paired diagnostic tests whether better process can fix proxy-collapse. It appends a hidden-blind workflow checklist to the same tasks and replays them. The checklist describes how to orient to the assay, build an evidence table, name public decoys, and contrast alternatives, and carries no hidden answer, rank, score, or candidate hint.
The checklist lifts process across the board while strict solves barely move, from 4 to 6 of 132. If the bottleneck were workflow organization, fixing the workflow would recover solves, and it does not. The residual difficulty is the scientific decision itself, where the public evidence still points agents toward edits that fail the hidden, model-backed objective. The benchmark is testing judgment about the biology, not formatting.