Synthetic Data: How AI Models Are Training on AI-Generated Data

Real data is running out. Synthetic data generation is becoming essential — but it comes with hidden risks.

The AI industry is running out of the one resource its entire scaling strategy depended on, human-written text, and the fix that labs have quietly converged on is training new models on data generated by old ones.

The data wall arrives on schedule

The scale of the problem is not speculative anymore. Estimates put the total volume of publicly available, reasonably high-quality internet text at roughly 300 trillion tokens, and at the training scales frontier labs are currently operating at, that entire pool is expected to be effectively exhausted within two to three years. This is the long-predicted data wall arriving roughly on the timeline researchers warned about years ago, and it forces a choice: slow down scaling, pay enormous sums to license proprietary data troves, or generate new training data synthetically. Every major lab has chosen some mixture of the second and third options, and the third has grown fastest because it is the only one that scales without a corresponding linear increase in cost.

How synthetic data actually gets made and used

The mechanism is straightforward to describe even though the engineering behind it is not: an existing capable model generates new text, code, or reasoning traces designed to serve as training material for the next model in the lineage. OpenAI, Google, and Anthropic have all confirmed using synthetic data in recent training runs, and Meta has disclosed that Llama 4's training set was roughly 40 percent synthetic. The appeal is obvious once you consider what it solves: synthetic data can be generated in effectively unlimited quantity, targeted precisely at gaps in the real-world data, more worked examples of medical reasoning, more coverage of a low-resource language synthesised from high-resource translations, more step-by-step math solutions, without waiting for the internet to organically produce more of that content.

Model collapse is the risk that keeps researchers up at night

The failure mode has a name now: model collapse. When a model trains substantially on data produced by an earlier model rather than by humans, small errors, stylistic quirks, and blind spots do not stay constant, they compound across generations. A paper published in Nature in late 2025 demonstrated this concretely, showing that models trained across multiple generations on primarily synthetic data progressively lost the ability to represent rare but genuinely important patterns in real-world data, essentially smoothing away the tails of the distribution in favor of whatever was most statistically common in the synthetic corpus. The danger is subtle precisely because a collapsing model does not necessarily get worse at common tasks, it gets worse specifically at the unusual, high-stakes cases that matter most and are hardest to notice failing in routine evaluation.

The hybrid curriculum that is emerging as best practice

The approach labs are converging on is neither pure synthetic nor pure organic data, it is a hybrid curriculum that deliberately mixes high-quality real-world data with carefully filtered synthetic data, paired with separate validation sets specifically designed to catch early symptoms of collapse before they compound across a full training run. This requires treating synthetic data generation as its own quality-controlled pipeline, with filtering steps that reject outputs showing the repetitive, mode-collapsed patterns that signal a model has started imitating its own prior outputs rather than genuinely reasoning through new examples. Vincony's fine-tuning pipeline builds this discipline in directly, supporting synthetic data generation alongside the quality-filtering steps needed to keep a fine-tuning dataset from drifting toward collapse.

What this means for teams fine-tuning smaller models

The stakes are not limited to the handful of labs training frontier-scale foundation models. Any team fine-tuning a smaller model on a mix of real customer data and synthetically augmented examples is running a miniature version of the same experiment, and the same collapse risk applies at that scale too, just faster, because a fine-tuning run has far less real data diluting the synthetic portion to begin with. Holding out a validation set drawn exclusively from real, unaugmented examples and checking that fine-tuned performance on rare edge cases does not quietly degrade across successive rounds of retraining is the practical, low-cost habit that catches this early, well before it shows up as a customer-facing failure.

A question bigger than any single training run

Underneath the technical debate sits a harder question that the industry has not resolved: if future models increasingly learn from text generated by earlier models rather than from the messy, contradictory, genuinely diverse output of human thought, what happens to the diversity of ideas and expression that made the original training data valuable in the first place. That is not a problem synthetic-data filtering alone can solve, and it is likely to shape how the next several generations of frontier models, and the smaller fine-tuned models built on top of them, actually behave in practice.