Research
I work on systems that turn open-ended work into trainable environments, verifiers, rewards, and model-improvement loops.
Current threads
Training scientific judgment
Verified campaign environments convert search, trust, escalation, and revision into a multi-turn RL problem with physics-grounded oracle reward.
Scaling test-time verification for novel materials
Probe-gradient guidance extracts band-gap signal from an unconditional crystal diffusion model and steers sampling without retraining.
Self-improving pretraining
Ordinary pretraining chunks become continuation-choice data, interleaved-thought SFT examples, and RL mid-training environments.
Papers
RLVR / Code
RL-trained 120B KernelBench policy plus verifier-filtered selection under fixed inference budgets.
Security agents / Calibration
Dual-control incident-response environment measures whether agents verify evidence before containment under adversarial context.
Agents / Continual learning
ATLAS frames agent improvement as online adaptation from trajectories rather than static fine-tuning alone.