Where should test-time compute go? Surprisal-guided selection in verifiable environments
Given a capable model, how should you spend test-time compute? More training, more samples, or smarter selection?
Given a capable model, how should you spend test-time compute? More training, more samples, or smarter selection?
LLMs achieve 94% precision on alert classification. That number looked promising until I gave four frontier models actual containment tools and watched them act on 82.5% of episodes they should have left alone.
Part 1 of a series on practical post-training pipelines for deployed agents.
This is a working note on research in progress. If you’re working on adaptive evaluation, continual learning, or tool-use agents, reach out at jbarnes850@gmail.com or Twitter.
This is a working note on how I think about world models: what they are, how to train them, and how they sit alongside agents. It’s written for a technical audience, and many of the ideas borrow from human learning.
My first attempt at building a distributed learning system wasn’t for a tech company. It was for a network of food banks.
My daughter was born in November of 2023. At the time, I was a new Dad asking AI every question I could think of. I even recorded her cries, desperately prompting AI: “Tell me what this means—help me!” (welcome to parenting in the age of AI)