Rethinking Evaluation for Agents That Never Stop Learning

This is a working note on research in progress. If you’re working on adaptive evaluation, continual learning, or tool-use agents, reach out at jbarnes850@gmail.com or Twitter.

“If we run the same evaluations on a continual basis, models might adapt and overfit to the evaluations. Especially if evaluations or feedback is implicit, the eval becomes part of the learning signal.”

Stephanie Chan, CoLLAS 2025 Keynote

Testing changes the learner. If you study how people learn, you see this quickly. Practice tests help students improve, and they also change what the test measures when you repeat it.

I think the same thing is happening with agents. When scores improve on a benchmark, I often cannot tell if the agent got better, or if it learned the benchmark.

This is the “eval becomes part of the learning signal” failure mode. Research on evolving benchmarks (like EvoEval) shows large performance drops and rank shifts when you transform coding benchmarks via LLM-driven perturbations. Systems that looked capable on static tests turned out to be brittle. They’d memorized patterns, not learned skills.

So the question is simple. How do we evaluate an agent that keeps changing?

Benchmarks as Environment Generators

Better sampling from a fixed item bank has a ceiling. Adaptive testing helps with efficiency. It keeps the bank fixed, and agents can still adapt to it.

My proposal is to treat the benchmark as a task and environment generator. It is a closed-loop system that produces new, unseen, execution-verifiable challenges conditioned on the agent’s observed behavior.

The core mechanism works like this:

Start with verifiable seeds. Real tasks with execution-based verification (tests that pass or fail, not vibes).

As an example, I previously built an integration with RLVE (originally proposed for slime) that demonstrates this. RLVE provides 400+ math and logic environments with deterministic binary rewards. Problems are generated on-the-fly during training, not pre-generated, which enables curriculum learning without contamination.
Generate derived challenge environments. Apply controlled transformations to create tasks that are related to the seeds but genuinely novel. Rename files, restructure directories, change configuration surfaces, add realistic distractors, alter constraints while preserving correctness.
Condition generation on agent behavior. The generator targets the agent’s current frontier, the boundary between what it can and can’t do. Think of it as probing for failure modes in addition to scoring successes.
Maintain longitudinal comparability. This is the hard part. If the benchmark keeps changing, how do you compare scores over time? The answer is psychometric linking: using anchor items and item response theory to place performance on a stable latent scale, even as the task bank evolves. Recent work on Fluid Benchmarking shows this is tractable for static models, achieving 50x efficiency gains on MMLU via IRT-based adaptive selection. The open question is whether linking holds when the agent itself is non-stationary.

The result is a benchmark that behaves like an adaptive environment designer.

The Teacher/Solver/Generator Loop

If you’ve read my earlier posts on world models, this structure will feel familiar. It is a closed-loop system where the evaluation learns about the agent, and the agent learns from the environment.

A common setup is a co-evolution framework:

Teacher: Diagnoses failure modes and specifies what to generate next. Which axes to shift, what to hold fixed, what difficulty region to target.
Generator: Proposes derived environments from seeds, constrained by verifiability, safety, and novelty requirements.
Solver: The evaluated agent. It doesn’t train during evaluation. It just responds to the challenges.

This is inspired by work like Socratic-Zero and Dr. Zero, where similar loops are used for curriculum generation during training.

The outputs shift from “score on benchmark X” to three questions: Where does the agent’s success-to-failure boundary sit as difficulty increases? How fast does that boundary move with experience? And do improvements on seen items transfer to fresh challenges? (Learning velocity is measured across evaluation sessions, not within a single run - the eval itself is a frozen snapshot.)

Agents are already doing forms of continual learning. Test-time training updates weights during inference. Context compression accumulates task-relevant state. Harness design and tool selection carry learning across sessions. Long-horizon agents are increasingly capable of reflecting on their own learning and previous trajectories to improve (thank you, Ralph Wiggum).

The hypothesis I’m testing is that adaptive evaluations reveal things static benchmarks miss. If that is true, it changes how we train and how we track progress.

Static benchmarks give a number. Deployed agents give you a trajectory. I care about the rate of learning once an agent is deployed and starts to pick up the details of my environment and preferences.

There’s a passage from Kirschner, Hendrick, and Heal’s Instructional Illusions that I keep returning to:

“The person who arrives at the right answer instantly is not necessarily blessed with superior processing power… it may be rather that they have built, through accumulated experience and knowledge, the kind of interconnected mental architecture that allows rapid retrieval. Networks are built. The quick, correct intuition is not evidence of a gifted mind operating mysteriously; it is evidence of a well stocked and well organised one.”

The more context we provide, the more intuitive the behavior. The next problem is measuring that intuition without contaminating it.