Research

I work on systems that turn open-ended work into trainable environments, verifiers, rewards, and model-improvement loops.

Current threads

Scientific discovery

The Missing Layer in Autonomous Science

Verified campaign environments convert search, trust, escalation, and revision into a multi-turn RL problem with physics-grounded oracle reward. Trajectory-level RL on 247 environments lifts held-out hypothesis accuracy from 55.2% to 79.3% with 3B active parameters, ahead of GPT-5.4 at 72.4%.

blog paper code model

Verification

Scaling test-time verification for novel materials

Probe-gradient guidance extracts band-gap signal from an unconditional crystal diffusion model and steers sampling without retraining. A 0.957-AUROC probe moves a model trained on 97.9% metals to 24-43% of samples in the target band-gap window, comparable to conditional generation at over 50x lower sampling cost.

blog code model

Post-training

Self-improving pretraining

Ordinary pretraining chunks become continuation-choice data, interleaved-thought SFT examples, and RL mid-training environments.

blog code model dataset

Interpretability

Do language models know when to change their mind?

Signal detection measures whether models revise for valid critique while resisting invalid critique across architecture, domain, and source pressure.

blog code

Papers

Scientific agents / Benchmark construction

Latent Mining for Verifiable Scientific Task Generation

A benchmark-construction method that turns disagreement between public scientific models into verifier-backed agent tasks. The first release is 165 regulatory-genomics triage tasks where 6 frontier agents reach an 11.3% aggregate strict solve rate and none exceed 15%.

blog paper dataset

Post-training / Trajectory data

Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

An incentive-aligned agent arena becomes a post-training trajectory substrate. A structural-quality filter turns raw ShoppingBench subnet traces into a trainable corpus that lifts Qwen3-4B from 18.0% to 42.7% strict solve rate, training on a fraction of a single day of arena output.

paper

RLVR / Code

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

RL-trained 120B KernelBench policy plus verifier-filtered selection under fixed inference budgets.

paper code model

Security agents / Calibration

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Dual-control incident-response benchmark measuring whether agents verify evidence before containment under adversarial context, with deterministic oracle scoring and published traces.

paper leaderboard code dataset

Agents / Continual learning

Continual Learning, Not Training: Online Adaptation For Agents

ATLAS modeled the advisor/executor pattern as a continual-learning system. A stronger teacher guided a lower-cost production agent, while reward-scored trajectories were stored as persistent learning memory and reused for inference-time adaptation, RL, and on-policy distillation.

paper technical report code docs reward system stateful eval distillation

Open Source

SGLang

Open-source inference contributions focused on model-serving performance and systems reliability.

Slime

Multi-turn RL training cookbook contributions for agentic post-training recipes.

CL-Bench

Evaluation harness for benchmarking continually learning agents.