Self-Improving Pretraining as a Substrate for Agentic Post-Training
Release artifacts
Code, model weights, and the generated interleaved-thinking dataset for this experiment.
Reference Terms
- Self-Improving Pretraining uses a stronger post-trained model to supervise pretraining of the current policy. The data stream is split into a prefix and a suffix (the last N tokens). The post-trained model can rewrite the suffix into a higher-quality target or act as a judge over the original suffix, the rewrite, and K rollouts from the current policy. Reinforcement learning pretrains the policy using the judge’s reward signal.
- Online DPO (direct preference optimization) is the RL objective Tan et al. use during self-improving pretraining. For each prefix, the judge scores a candidate set (original suffix, rewrite, K rollouts) and an update favors the highest-scoring candidate as chosen and the lowest as rejected. Unlike GRPO, DPO is off-policy, so it can learn from candidates the current policy did not produce.
- Interleaved thoughts are short reasoning spans inserted by a teacher model into ordinary pretraining chunks at semantically appropriate positions. The augmented chunk alternates segments of original text with inserted thoughts, while the original content still concatenates back to the unaltered chunk. The student first learns the format through SFT and is later rewarded for thoughts that help predict the next chunk.
- RLMT (reinforcement learning mid-training) is the second mid-training phase. Each augmented pretraining chunk is split into a prefix and a ground-truth suffix. From the prefix, the model generates a thought followed by a predicted suffix. An LLM judge compares the predicted suffix to the ground-truth suffix and emits a binary reward, and the RL update maximizes that reward. The reward is on the thought-conditioned suffix prediction, not on the thought itself.
- Reward gate is the held-out evaluation that scores the same object RLMT trains on. The model generates a thought, predicts the suffix, and the judge scores the predicted suffix against the true suffix on items the model never saw during training.
Shaping Model Behavior
Modern model training is a lifecycle. A base model is pretrained, adapted through mid-training, extended for context length, taught thinking formats, preference-tuned, and then optimized with verifiable rewards.
Agentic post-training asks that model to reason through uncertainty, plan over longer horizons, use tools, revise its beliefs, and recover from mistakes. These behaviors are usually optimized late in the lifecycle, and recent RLVR work argues that late optimization often improves sampling efficiency more than it creates fundamentally new reasoning patterns.
Can more of that behavior be shaped earlier in the lifecycle, particularly during pretraining, so the model arrives at agentic post-training already prepared for it?
Rephrasing a pretraining corpus with synthetic data is not new. Synthetic bootstrapped pretraining pushes that idea further by learning relations between documents from the corpus itself, then using those relations to synthesize new pretraining data. The learning problem around the corpus also changes. Instead of treating pretraining as passive exposure to text, the training loop can ask whether the model can generate a better continuation, whether it can insert local reasoning into the text, and whether those thoughts help downstream prediction.
Synthetic pretraining and active data design are the closest neighboring lines. Hugging Face’s Synthetic Data Playbook is a practical guide to prompt design, data mixing, small generators, and train-and-evaluate loops. Tufa Labs reports a benchmark-positive version of very-small-model synthetic pretraining. The Berkeley Sky Computing Lab and NVIDIA lecture treats it as active curriculum design. Vintage Data argues that synthetic pretraining is data design moving earlier in model development. Large-scale studies of synthetic data in LLM pre-training also show that the benefits are conditional. Rephrased synthetic data can help in mixtures, while pure generated textbook-style data can hurt downstream generalization.
This experiment tested whether a small base model could be shaped earlier in the lifecycle, before agentic post-training, using the same kind of pretraining chunks it would normally only imitate.
From Pretraining Text to Training Environment
Tan et al. propose two linked training ideas. The first is self-improving pretraining.Self-improving pretraining uses stronger post-trained models to improve earlier stages of the training pipeline. The second is thinking mid-training.Thinking mid-training is Tan et al.’s term for the mid-training arc that adds interleaved reasoning to pretraining text. SFT teaches the thought format, and RLMT then rewards thoughts that help predict the next chunk.
Self-improving pretraining changes what the model is asked to learn from a pretraining chunk. In standard next-token pretraining, the model sees a prefix and learns to imitate the suffix that happened to come next in the corpus. Tan et al. instead use a stronger post-trained model in two ways. It rewrites the suffix into a higher-quality target, and it acts as a judge over the original suffix, the rewrite, and K rollouts from the current policy. The update favors the continuation judged best, so the signal of interest is not lower loss on the original text but whether the model’s own rollouts start becoming good enough to beat the data suffix.
Thinking mid-training works one stage later. Mid-training is the stage between base pretraining and final post-training, where labs often adapt a model toward longer context, reasoning formats, domain mixtures, or other broad capabilities before task-specific alignment. Tan et al. use this stage to teach a model to interleave short thoughts inside ordinary text. A related line of reasoning continued pretraining synthesizes hidden thoughts from ordinary text, treating the observed document as the product of an implicit reasoning process. A teacher model inserts local reasoning spans into pretraining chunks, the student learns that format through supervised fine-tuning, and then RL mid-training (RLMT) rewards the student when its generated thought helps predict the held-out continuation.
The same corpus then becomes multiple training environments. Raw chunks become self-improving continuation preference data, then interleaved-thinking data, then supervised thought-interface training, then RLMT.
The model generates K alternative continuations. A judge ranks them against the original suffix, and Online DPO favors the highest-scoring candidate.
A teacher model inserts short thoughts at semantically appropriate positions. The student learns the format through SFT, where the original text still concatenates back to the unaltered chunk.
The model generates a thought, then a predicted suffix. The judge scores the predicted suffix against the true continuation, and reward depends on whether the suffix is good enough.
A Small-Scale Lifecycle Experiment
I ran the adaptation on Qwen3-0.6B-Base to test whether the lifecycle shape was trainable at small scale.
The data came from FineWeb-Edu, materialized into ordinary pretraining chunks of prefix and true suffix. That structure stayed constant across the experiment. What changed was the learning problem wrapped around it.
The first stage was self-improving continued pretraining with Online DPO. For each prefix, the model generated 16 candidate continuations. A larger Qwen judge compared the original suffix and the policy rollouts, using full-pairwise comparisons rather than a single pivot comparison. The update pushed the model toward the judged-better continuation and away from the judged-worse one.
The second stage turned the same kind of pretraining chunks into interleaved-thinking examples. A teacher model inserted short local thoughts into the text while preserving the original content, reusing the teacher’s rewrite role from self-improving pretraining but relocated into mid-training. The student was then supervised on the augmented sequence, including both the original text and the inserted thoughts. This stage was not meant to prove reasoning improvement by itself. It was meant to install the interface, where thoughts appear, what they look like, and how they connect to nearby text.
The third stage was RL mid-training. The model received a prefix, generated a thought, then predicted the next suffix. A judge scored the predicted suffix against the real suffix. The reward did not directly grade the thought. It only asked whether the thought-conditioned continuation was good enough.
To compare, one lineage started from the original Qwen3-0.6B-Base. The other started from the self-improved checkpoint. Both were trained on the same interleaved-thinking data, then both went through the same RLMT setup. Did self-improving pretraining leave the model better prepared for later thinking mid-training?
The comparison isolates whether self-improving continued pretraining leaves the model better prepared for later thinking mid-training. Both lineages use the same interleaved-thinking data and RLMT objective.
| Stage | Input | Learning problem | Output |
|---|---|---|---|
| Self-improving continued pretraining | FineWeb-Edu prefix and suffix chunks | Choose between the original suffix and K=16 policy rollouts using full-pairwise judge comparisons and Online DPO | A self-improved base checkpoint |
| Interleaved-thinking SFT | The same kind of raw chunks, augmented by a teacher | Learn to reproduce original text with short local thoughts inserted | A model that can use the thought interface |
| RL mid-training | Prefixes from the augmented corpus | Generate a thought, predict the suffix, and receive reward from suffix quality | A model optimized through a sparse thought-conditioned reward |
| Evaluation | Held-out prefixes and reasoning tasks | Separate continuation quality, reward-object performance, downstream reasoning, and causal thought use | Evidence about where the lifecycle worked and where it remained immature |
Results
Continued pretraining improved judged continuation quality
On held-out prefix-suffix chunks, the self-improved checkpoint beat Qwen3-0.6B-Base on 81 of 128 pairwise continuation judgments, a 63.28% win rate.
Training dynamics moved in the same direction. The rate at which the judge selected a policy rollout rather than the original suffix rose from 62.5% to 74.6%, and the DPO marginDPO margin is the average log-probability gap between chosen and rejected continuations across training pairs. A larger margin means the policy is more strongly preferring the judged-better continuation. rose to 0.19.
Raw suffix NLL moved the other way, from 2.56 to 2.60. Online DPO was not optimizing exact imitation of the original suffix. It was optimizing judged continuation quality.
SFT installed the thought interface
The final interleaved-thinking dataset had 8,704 rows, with 8,192 train rows and 512 validation rows. It averaged 4.39 thoughts per row, had zero malformed rows, and preserved the raw text closely, with average raw word coverage of 99.98%.
SFT over this dataset changed the model’s ability to model thought tokens. Thought NLLThought NLL is the negative log-likelihood the model assigns to the inserted-thought tokens specifically, measured separately from the surrounding original text. Lower means the model is fluently producing thoughts in the trained format. dropped from 4.24 for the raw baseline to 3.16 for the base model trained on interleaved thoughts and 3.14 for the self-improved model trained on interleaved thoughts.
SFT installed the interface. It did not by itself answer whether thoughts were useful. The pre-RLMT comparison between the self-improved thinking model and the base thinking model was exactly tied at 64-64, which made RLMT the next test.
RLMT made the interface rewardable
The post-RLMT reward gate evaluated the same object RLMT trained on, a thought followed by a predicted suffix, scored against the true suffix. The base thinking model had a reward mean of 0.088. The self-improved thinking model reached 0.091. After RLMT, the base lineage reached 0.094 and the self-improved lineage reached 0.098, the highest of the four.
The downstream reasoning eval was mixed. I evaluated GSM8K, MATH-500, GPQA-Diamond, and OlympiadBench with eight samples per problem. On macro Mean@8Mean@8 is the average correctness rate across 8 samples per problem. It asks how often an average sample is correct., the self-improved thinking model was strongest before RLMT at 0.175, while the self-improved RLMT model fell to 0.166. On macro Pass@8Pass@8 counts a problem as solved if any of 8 samples is correct. It asks whether the model can solve the problem within an 8-sample budget., the self-improved RLMT model was highest at 0.512.
Mean@8 and Pass@8 split here, and at this scale the reward-object result was cleaner than the downstream benchmark result.
Thought text controlled continuation behavior
To move beyond aggregate reward, I ran a causal thought-use probe. Replace the model’s generated thought before suffix generation, then measure whether suffix reward changes. This tests whether interleaved thoughts are just formatting, or whether they actually steer continuation behavior.
Swapping in an unrelated thought usually reduced suffix reward. Thought text was not decorative. The thought channel had become a real handle on the model’s continuation policy, not just a tag format learned during SFT.
I then ran a follow-up diagnostic to determine if the weak sampled-thought result came from a useless thought channel, weak teacher thoughts, or a small model that could not reliably generate useful thoughts for a real channel.
For each held-out prefix, I compared the model’s own sampled thought against three counterfactuals, the original teacher-inserted thought from the dataset, a blank or generic control, and a swapped unrelated thought. I also sampled 16 model thoughts per prefix and measured an oracle upper bound, the fraction of prefixes where any sampled thought led to a rewarded suffix.
| Arm | Own thought | Teacher thought | Max(blank, generic) | Swapped thought | Oracle@16 |
|---|---|---|---|---|---|
| Base + thinking SFT | 0.023 | 0.039 | 0.094 | 0.008 | 0.250 |
| Self-improved + thinking SFT | 0.016 | 0.055 | 0.039 | 0.000 | 0.219 |
| Base + thinking SFT + RLMT | 0.016 | 0.117 | 0.063 | 0.016 | 0.156 |
| Self-improved + thinking SFT + RLMT | 0.039 | 0.070 | 0.055 | 0.008 | 0.375 |
Own, teacher, control, and swapped columns report suffix-reward means. Max(blank, generic) is the better no-content control for that arm. Oracle@16 is the fraction of prefixes where at least one of 16 sampled model thoughts produced a rewarded suffix.
This narrowed the failure mode. Swapped unrelated thoughts usually collapsed reward. The teacher thoughts were not merely decorative. They improved reward in three of four arms, most clearly in the base RLMT lineage, where teacher thoughts reached 0.117 reward compared with 0.016 for the model’s own sampled thoughts. But blank and generic controls still beat the model’s own thoughts in every arm. The model had learned a thought-conditioned interface before it learned to reliably author useful thoughts for that interface.
The oracle@16 result makes the failure sharper. Useful thoughts existed in the model’s sampling distribution, especially for the self-improved RLMT lineage, where at least one of 16 sampled thoughts succeeded on 37.5% of prefixes. But that is an upper bound, not a policy. A deployed model samples one thought, not 16 thoughts with access to an oracle judge.
I also tested whether simple selection could recover the headroom. It could not. A format heuristic and a prefix-only LLM selector both failed. The selector saw only the prefix and candidate thoughts, not the reference suffix, generated suffix, or judge score. Across arms, the prefix-only selector reached 0.008 macro reward, below the random expected reward of 0.021 and far below the macro oracle@16 upper bound of 0.250.
The unresolved problem is policy-useful thought generation. The winning thoughts were not simply the ones that looked cleaner, more coherent, or more locally relevant to another model reading the prefix; their usefulness depended on how this 0.6B policy used the thought during suffix generation. The positive result is causal sensitivity.Causal sensitivity means that changing the thought text changes the model’s later continuation behavior under the same prefix. Here, unrelated thoughts sharply reduced suffix reward, so the thought text was not just formatting. The failure is unreliable self-generation.
| Stage | Primary evidence | Interpretation |
|---|---|---|
| Continued pretraining | 81/128 held-out pairwise continuation wins over Qwen3-0.6B-Base | The self-improved checkpoint produced better judged continuations |
| Interleaved-thinking SFT | Thought NLL dropped from 4.24 to 3.16 and 3.14 | SFT installed the thought interface |
| RLMT reward gate | Self-improved RLMT reached the highest reward mean at 0.098 | RLMT made the interface rewardable under the suffix-prediction objective |
| Reasoning eval | GSM8K, MATH-500, GPQA-Diamond, and OlympiadBench moved differently across Mean@8 and Pass@8 | Downstream reasoning did not improve uniformly |
| Thought-use probe | Swapped thoughts reduced reward, while teacher thoughts and oracle@16 exposed usable headroom | Thought text changed continuation behavior, but the policy did not reliably author useful thoughts |
Implications for Training
Small language models are usually framed as vertical tools. They work when the task is narrow, the domain is bounded, and the behavior is easy to specify.
A small language model can be specialized for a domain and for the behaviors later training will need to refine.
Tan et al. argue that reasoning should be trained earlier in the lifecycle because raw pretraining text does not expose the intermediate reasoning behavior later post-training has to optimize. I read this experiment as a small-scale version of that idea. The model learned a continuation preference, a thought interface, and a rewardable suffix-prediction loop. The thought channel then became behaviorally causal.
The question is whether early training can make a small model easier to turn into a reliable agent later, not only whether it can solve a vertical task.
Self-improving pretraining is most valuable where data is scarce, expensive, private, long-tail, or behaviorally underspecified. Synthetic continued pretraining is explicitly motivated by facts appearing rarely in small domain corpora, and improves learning by rearranging knowledge into more learnable forms. Privacy-constrained continual pretraining is also emerging, with encrypted synthetic data pipelines for sensitive-domain adaptation.
Smaller domain models often need reliable behavioral interfaces before post-training. Scientific agents, code agents, compliance analysts, medical-document assistants, internal enterprise copilots, and multilingual or domain-specific models all fit this shape. In those settings, the valuable synthetic target is structured behavior, a model that can plan, inspect, cite, verify, revise, call tools, recover from failed assumptions, and predict the next domain-relevant state.
A practical recipe is to take a domain corpus, preserve the original text, synthesize interleaved local reasoning or decision traces, train the model to use that interface, and then reward the interface only through downstream observable outcomes.
For scientific and engineering agents, the relevant behaviors are often not captured by today’s benchmarks. Planning, revision, tool use, verification, and recovery are lifecycle behaviors. If those behaviors are bounded by what the base model can express, then the place to shape them is earlier than we usually do.
Limitations and Closing
This is a small-scale adaptation of Tan et al., not a full reproduction. The experiment uses Qwen3-0.6B-Base, a short RLMT run, and a smaller evaluation budget than the original paper. The right comparison is whether the same training signals work at small scale, not whether the benchmark numbers match Tan et al.
The main limitation is scale. The continued-pretraining result is a judged continuation-quality result, the RLMT reward gate measures the paper-aligned suffix-prediction objective, and the downstream reasoning evaluation is mixed across benchmarks and metrics. These results support the claim that the lifecycle can be made trainable at small scale. They do not support a claim that this model learned a mature agentic reasoning policy.
The thought-use probe should be read the same way. Swapped thoughts reduced reward, teacher thoughts often helped, and oracle@16 exposed useful sampled thoughts. The interface existed, but the learned policy was still immature, and did not reliably generate or select the thoughts that made its own continuation dynamics work.
Ordinary pretraining text can be turned into a sequence of training environments before agentic post-training begins. Continued pretraining can select better continuations, interleaved SFT can expose a thought interface, and RLMT can make that interface rewardable. For small language models, that suggests a path beyond narrow vertical specialization. Train the behavioral interface earlier, then let later post-training refine it.
The experiment is small, but if later post-training is bounded by the behaviors a base model can express, then some of the most important work belongs earlier in the lifecycle.