Rethinking Evaluation for Agents That Never Stop Learning
Online evaluations turn production traces into verified frontier tasks with calibrated difficulty, while anchor sets keep progress comparable as agents improve.
Online evaluations turn production traces into verified frontier tasks with calibrated difficulty, while anchor sets keep progress comparable as agents improve.
Tiny Aya probes show how multilingual models route language into target-script, format, entity, and stopping behavior before decoding.
I adapted self-improving pretraining, interleaved-thought SFT, and RL mid-training to Qwen3-0.6B-Base.
Signal detection framework for LLM belief revision across 6 open-weight models.
Surprisal-guided selection routes test-time compute toward high-value candidates, matching oracle selection with far fewer evaluated samples.
Frontier security agents detect real threats but over-trigger containment, exposing a calibration gap between detection and restraint.
Verifier-based sampling can beat a narrow RL gain when deployment can afford multiple attempts, changing the cost-benefit case for multi-turn training.
Becoming an expert means deliberately building the mental architecture to judge, question, and understand what models generate.
Browser agents need explicit consequence models that predict how actions change state, enabling planning, credit assignment, and transferable learning.
Production agents need shared learning loops where failures become reusable experience across the network instead of one-off human patches.
Institutional knowledge becomes dynamic when every diff, decision, and correction is searchable, reviewable, and available at the moment of use.