# Jarrod Barnes - Full Context

> Researcher and founder working on autonomous scientific discovery.

## Bio

Jarrod Barnes is a researcher and founder who runs Dynamical Systems, a research lab that builds the environments, evaluations, and verification systems that make the scientific loop run end to end.

Previously founded Arc, where he built ATLAS, a continual learning framework for LLM agents deployed with F500 enterprises (93.7% RewardBench V2). Before that, built RL environments and distributed training infrastructure at NEAR.

Background: football coach at Ohio State, Clemson, and the Los Angeles Rams; early-stage investor at Emerson Collective; Assistant Professor at NYU; doctoral student at UIUC studying how to keep learners at the edge of learnability.

Primary stack: Python, Rust, PyTorch, Ray, SGLang.

Open source contributions: SGLang (model serving optimizations), Slime/THUDM (multi-turn RL training cookbooks).

## Selected Work

### 1. Test-Time Verification for Novel Materials (Dynamical Systems, 2026)

Problem: In materials discovery, the generation-verification gap means generating more candidates without earlier verification produces diminishing returns. The final verifier is nature.

What was built: Probe-gradient guidance for crystal diffusion models. A 256-parameter linear probe extracts property signal from model hidden states at each denoising step. Backpropagation through the probe steers the generation trajectory toward target structures at test time.

Key results:
- An unconditional model trained on 97.9% metals reaches 24-43% of structures in a target band-gap window
- Comparable to state-of-the-art conditional generation (MatterGen with self-correcting search: 25-28%)
- Over 50x faster per sample than conditional generation with self-correction
- Compositional uniqueness holds at 99.6-99.9% across all guidance weights
- Pareto sweep: 18,432 structures, 6 guidance weights, 3 seeds, 1,024 samples per batch
- Formation energy probe: AUROC 0.990
- Swap the probe, change the target. No retraining, no mode collapse.

Artifacts:
- Writeup: https://dynamicalsystems.ai/blog/scaling-test-time-verification
- Code: https://github.com/Dynamical-Systems-Research/test-time-verification

### 2. Training Scientific Judgment (Dynamical Systems, 2026)

Problem: Autonomous laboratories compress every layer of discovery except the judgment connecting them. Models lack a training signal for sequential evaluation under budget pressure and noisy evidence.

What was built: Verified campaign environments with physics-grounded oracles. Each environment pairs a candidate pool with a staged verifier ladder (cheap, medium, final), a budget constraint, and deterministic reward for every trust, escalate, and revise decision. 247 open-world training environments. Hybrid reward with 7 oracle-grounded components plus a gated rubric bonus. Difficulty schedule via UCB bandit over environment parameters. SFT on expert demonstrations followed by multi-turn RL. Domain-general abstraction: any scientific domain with a candidate source and a verification oracle can produce campaign environments.

Key results:
- RL-trained model: 60% hypothesis accuracy (vs 47% base, 53% SFT)
- Structure recall on MADE benchmark exceeds GPT-5.4 by 67%
- Formula recall trails GPT-5.4 by 54% (knowledge scales with pretraining, judgment scales with environment training)
- No regression on held-out closed-world tasks
- Behavioral changes: trusts evidence more selectively, escalates when cheap evidence is ambiguous, revises beliefs faster under conflicting results

Artifacts:
- Writeup: https://dynamicalsystems.ai/blog/training-scientific-judgment
- Code: https://github.com/Dynamical-Systems-Research/training-scientific-judgment

### 3. Metacognition Benchmark (Independent Research, 2026)

Problem: Models can distinguish valid corrections from invalid pushback in their internal representations. Whether they act on that distinction is a separate question. Existing benchmarks measure accuracy or calibration in isolation.

What was built: Signal detection framework for LLM belief revision. 969 items across 8 datasets (science and commonsense). Each item pairs a correct initial answer with a valid correction and an invalid pushback. Measures d-prime (discrimination) and criterion c (decision threshold) as a function of model scale. Full sweep across Qwen3.5 dense family (0.8B to 9B) with entropy-conditioned analysis and linear probes.

Key results:
- Competence scales before control: 2B is most sycophantic (FAR=0.87), not smallest
- U-curve in criterion c: metacognitive control turns on between 2B (-1.67) and 4B (-0.88)
- Confident 9B (low varentropy): d'=2.32, FAR=0.16
- Uncertain 9B: d'=1.38, FAR=0.56
- Science d-prime scales: 0.92 to 2.29
- Commonsense d-prime plateaus: 0.89 to 1.35
- Domain gap grows from 0.03 at 2B to 0.94 at 9B

Artifacts:
- Writeup: https://jbarnes850.github.io/2026/03/20/do-models-know-when-to-change-their-mind/
- Code: https://github.com/jbarnes850/metacognition

## Publications

1. Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation (arXiv 2026)
   - Paper: https://arxiv.org/abs/2602.07670
   - Code: https://github.com/jbarnes850/test-time-training
   - Model: https://huggingface.co/Jarrodbarnes/KernelBench-RLVR-120b

2. OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence (arXiv 2026)
   - Paper: https://arxiv.org/abs/2601.21083
   - Code: https://github.com/jbarnes850/opensec-env
   - Key result: 45-97.5% false positive rates across frontier models

3. Continual Learning, Not Training: Online Adaptation for Agents (arXiv 2025)
   - Paper: https://arxiv.org/abs/2511.01093
   - Code: https://github.com/Arc-Computer/ATLAS
   - Key result: 93.7% RewardBench V2

## Contact

- Email: jbarnes850@gmail.com
- GitHub: https://github.com/jbarnes850
- LinkedIn: https://linkedin.com/in/jarrodbarnes
- Lab: https://dynamicalsystems.ai