Research

I work on systems that turn open-ended work into trainable environments, verifiers, rewards, and model-improvement loops.

Current threads

Training scientific judgment

Verified campaign environments convert search, trust, escalation, and revision into a multi-turn RL problem with physics-grounded oracle reward.

Scaling test-time verification for novel materials

Probe-gradient guidance extracts band-gap signal from an unconditional crystal diffusion model and steers sampling without retraining.

Self-improving pretraining

Ordinary pretraining chunks become continuation-choice data, interleaved-thought SFT examples, and RL mid-training environments.

Do language models know when to change their mind?

Signal detection measures whether models revise for valid critique while resisting invalid critique across architecture, domain, and source pressure.

Papers

RLVR / Code
RL-trained 120B KernelBench policy plus verifier-filtered selection under fixed inference budgets.
Security agents / Calibration
Dual-control incident-response environment measures whether agents verify evidence before containment under adversarial context.
Agents / Continual learning
ATLAS frames agent improvement as online adaptation from trajectories rather than static fine-tuning alone.

Production and OSS infrastructure

Model-serving optimization contributions in the open-source inference stack.
Multi-turn RL training cookbook contributions for agentic post-training recipes.
Evaluation harness for benchmarking continually learning agents.