Research
I work on systems that turn open-ended work into trainable environments, verifiers, rewards, and model-improvement loops.
Current threads
Scientific discovery
Verified campaign environments convert search, trust, escalation, and revision into a multi-turn RL problem with physics-grounded oracle reward. Trajectory-level RL on 247 environments lifts held-out hypothesis accuracy from 55.2% to 79.3% with 3B active parameters, ahead of GPT-5.4 at 72.4%.
Verification
Probe-gradient guidance extracts band-gap signal from an unconditional crystal diffusion model and steers sampling without retraining. A 0.957-AUROC probe moves a model trained on 97.9% metals to 24-43% of samples in the target band-gap window, comparable to conditional generation at over 50x lower sampling cost.
Post-training
Ordinary pretraining chunks become continuation-choice data, interleaved-thought SFT examples, and RL mid-training environments.
Interpretability
Signal detection measures whether models revise for valid critique while resisting invalid critique across architecture, domain, and source pressure.
Papers
Scientific agents / Benchmark construction
A benchmark-construction method that turns disagreement between public scientific models into verifier-backed agent tasks. The first release is 165 regulatory-genomics triage tasks where 6 frontier agents reach an 11.3% aggregate strict solve rate and none exceed 15%.
Post-training / Trajectory data
An incentive-aligned agent arena becomes a post-training trajectory substrate. A structural-quality filter turns raw ShoppingBench subnet traces into a trainable corpus that lifts Qwen3-4B from 18.0% to 42.7% strict solve rate, training on a fraction of a single day of arena output.
RLVR / Code
RL-trained 120B KernelBench policy plus verifier-filtered selection under fixed inference budgets.
Security agents / Calibration
Dual-control incident-response benchmark measuring whether agents verify evidence before containment under adversarial context, with deterministic oracle scoring and published traces.
Agents / Continual learning
ATLAS modeled the advisor/executor pattern as a continual-learning system. A stronger teacher guided a lower-cost production agent, while reward-scored trajectories were stored as persistent learning memory and reused for inference-time adaptation, RL, and on-policy distillation.