# Jarrod Barnes - Full Context > Researcher and founder working on autonomous scientific discovery. ## Bio Jarrod Barnes is a researcher and founder who runs Dynamical Systems, a research lab that builds the environments, evaluations, and verification systems that make the scientific loop run end to end. Previously founded Arc, where he built ATLAS, a continual learning framework for LLM agents deployed with F500 enterprises (93.7% RewardBench V2). Before that, built RL environments and distributed training infrastructure at NEAR. Background: football coach at Ohio State, Clemson, and the Los Angeles Rams; early-stage investor at Emerson Collective; Assistant Professor at NYU; doctoral student at UIUC studying how to keep learners at the edge of learnability. Primary stack: Python, Rust, PyTorch, Ray, SGLang. Open source contributions: SGLang (model serving optimizations), Slime/THUDM (multi-turn RL training cookbooks). ## Selected Work ### 1. Test-Time Verification for Novel Materials (Dynamical Systems, 2026) Problem: In materials discovery, the generation-verification gap means generating more candidates without earlier verification produces diminishing returns. The final verifier is nature. What was built: Probe-gradient guidance for crystal diffusion models. A 256-parameter linear probe extracts property signal from model hidden states at each denoising step. Backpropagation through the probe steers the generation trajectory toward target structures at test time. Key results: - An unconditional model trained on 97.9% metals reaches 24-43% of structures in a target band-gap window - Comparable to state-of-the-art conditional generation (MatterGen with self-correcting search: 25-28%) - Over 50x faster per sample than conditional generation with self-correction - Compositional uniqueness holds at 99.6-99.9% across all guidance weights - Pareto sweep: 18,432 structures, 6 guidance weights, 3 seeds, 1,024 samples per batch - Formation energy probe: AUROC 0.990 - Swap the probe, change the target. No retraining, no mode collapse. Artifacts: - Writeup: https://dynamicalsystems.ai/blog/scaling-test-time-verification - Code: https://github.com/Dynamical-Systems-Research/test-time-verification ### 2. Training Scientific Judgment (Dynamical Systems, 2026) Problem: Autonomous laboratories compress every layer of discovery except the judgment connecting them. Models lack a training signal for sequential evaluation under budget pressure and noisy evidence. What was built: Verified campaign environments with physics-grounded oracles. Each environment pairs a candidate pool with a staged verifier ladder (cheap, medium, final), a budget constraint, and deterministic reward for every trust, escalate, and revise decision. 247 open-world training environments. Hybrid reward with 7 oracle-grounded components plus a gated rubric bonus. Difficulty schedule via UCB bandit over environment parameters. SFT on expert demonstrations followed by multi-turn RL. Domain-general abstraction: any scientific domain with a candidate source and a verification oracle can produce campaign environments. Key results: - RL-trained model: 60% hypothesis accuracy (vs 47% base, 53% SFT) - Structure recall on MADE benchmark exceeds GPT-5.4 by 67% - Formula recall trails GPT-5.4 by 54% (knowledge scales with pretraining, judgment scales with environment training) - No regression on held-out closed-world tasks - Behavioral changes: trusts evidence more selectively, escalates when cheap evidence is ambiguous, revises beliefs faster under conflicting results Artifacts: - Writeup: https://dynamicalsystems.ai/blog/training-scientific-judgment - Code: https://github.com/Dynamical-Systems-Research/training-scientific-judgment ### 3. Metacognition Benchmark (Independent Research, 2026) Problem: Models can distinguish valid corrections from invalid pushback in their internal representations. Whether they act on that distinction is a separate question. Existing benchmarks measure accuracy or calibration in isolation. What was built: Signal detection framework for LLM belief revision. 969 items across 8 datasets (science and commonsense). Each item pairs a correct initial answer with a valid correction and an invalid pushback. Measures d-prime (discrimination) and criterion c (decision threshold) as a function of model scale. Full sweep across Qwen3.5 dense family (0.8B to 9B) with entropy-conditioned analysis and linear probes. Key results: - Competence scales before control: 2B is most sycophantic (FAR=0.87), not smallest - U-curve in criterion c: metacognitive control turns on between 2B (-1.67) and 4B (-0.88) - Confident 9B (low varentropy): d'=2.32, FAR=0.16 - Uncertain 9B: d'=1.38, FAR=0.56 - Science d-prime scales: 0.92 to 2.29 - Commonsense d-prime plateaus: 0.89 to 1.35 - Domain gap grows from 0.03 at 2B to 0.94 at 9B Artifacts: - Writeup: https://jbarnes850.github.io/2026/03/20/do-models-know-when-to-change-their-mind/ - Code: https://github.com/jbarnes850/metacognition ## Publications 1. Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation (arXiv 2026) - Paper: https://arxiv.org/abs/2602.07670 - Code: https://github.com/jbarnes850/test-time-training - Model: https://huggingface.co/Jarrodbarnes/KernelBench-RLVR-120b 2. OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence (arXiv 2026) - Paper: https://arxiv.org/abs/2601.21083 - Code: https://github.com/jbarnes850/opensec-env - Key result: 45-97.5% false positive rates across frontier models 3. Continual Learning, Not Training: Online Adaptation for Agents (arXiv 2025) - Paper: https://arxiv.org/abs/2511.01093 - Code: https://github.com/Arc-Computer/ATLAS - Key result: 93.7% RewardBench V2 ## Contact - Email: jbarnes850@gmail.com - GitHub: https://github.com/jbarnes850 - LinkedIn: https://linkedin.com/in/jarrodbarnes - Lab: https://dynamicalsystems.ai