Jinjie Ni†, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh

†Correspondence to: Jinjie Ni <[email protected]>

Released on Aug 09 2025

🚨 arXiv | Training Backend | GitHub | Resources

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpJek9XUTRaakF6WVRnMk5qZ3dNR0ZpTVRrMlpUUTVPVEk0WXpBeE9XRmpJbjA9

Cite this work

Recent research highlights the potential of diffusion language models (DLMs). Owing to the parallel decoding design, they can generate thousands of tokens per second, resulting in exceptionally low latency for real-world applications [17][18][19]. Moreover, several recent DLMs have demonstrated performance on par with autoregressive (AR) models [8][9].

But is speed their only advantage? After rigorous investigations over the past few months, we discovered a more striking trait: diffusion models are super data learners under fixed data budgets. That is, given the same number of unique pre-training tokens, diffusion models consistently outperform AR counterparts of equal size—by trading additional FLOPs for improved learning. This reflects a roughly >3x data potential of AR models.

Such data potential is increasingly valuable as we approach the limits of available pre-training data [20], especially given that AR models show diminishing returns after just four epochs of data reuse [11]. Coincidentally, a concurrent study [1] explores similar topics. However, our careful analysis reveals several methodological issues in [1] that may lead to flawed conclusions.

In this post, we present preliminary results providing strong evidence for a clear “crossover” point where diffusion models outperform AR models. We then delve into the learning behavior of diffusion models to shed light on how this advantage emerges. Finally, we offer a detailed critique of the problematic methodologies in [1], aiming to guide more robust future research.

<aside> ✨

Highlights

We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demonstrate > 3x greater data potential compared to autoregressive (AR) models. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. Note that more repetitions could further improve its performance, as no signs of diminishing returns were observed.
DLMs are super-dense models that consume more FLOPs than dense AR models. Training DLMs to fully leverage the data typically demands at least two orders of magnitude more FLOPs. During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines. In addition, the more expressive bidirectional attention enabled by the diffusion objective allows bidirectional modeling of the language data, which is not fully causal, to fully squeeze its value.
Our concurrent work, “Diffusion Beats Autoregressive in Data-Constrained Settings”, contains methodological issues potentially leading to problematic conclusions, including problematic diffusion loss formulation, invalid metrics for comparison, unfair settings for AR models, and problematic scaling law formulation. All of which might lead to potentially misleading results and conclusions. </aside>

Table of Content

1. Preliminary Results

<aside> ✨

Section Highlights

We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. Under unique data constraints, DLM clearly outperforms the AR counterparts at some point by repeating the data, demonstrating >3x data potential compared to autoregressive (AR) models. Notably, the crossover point on different evals are similar.
Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. More repetitions could further improve its performance, as no signs of diminishing returns were observed.
Models that “overfit” on the validation set often keep improving on downstream tasks. After overfitting, absolute NLL rises due to overconfidence, but △NLL continues to widen—indicating preserved or even enhanced discriminative ability.
Though being robust to data repetition, DLMs also get overfit–as we train them for enough long epochs. Larger unique data size delay overfitting, while larger models accelerate its onset. </aside>

1.1 The Intelligence Crossover

Figure A: The performance comparison of autoregressive (AR) and masked diffusion models (Diffusion) when repeating on a limited portion of data. All models are trained on 96B total tokens (including repetition), varying the unique tokens from 0.5B to 96B. Diffusion models exploit the data better through more repetition on limited unique data. More unique tokens requires more repetition to see the crossover, where the high unique token runs postpone the crossover beyond our 96B token observation scope.

Overall Setup: Dense 1B/8B models trained on a fixed 96B-token budget, varying unique tokens from 0.5B to 96B. A 1B DLM was also trained for 480 epochs on 1B unique tokens.

Figure A presents an extensive set of results, providing compelling evidence that, by repeating on normal web data, masked DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.

Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to autoregressive models. This estimate is empirically supported within our experiments, as DLMs trained on only 0.5B unique tokens (not converged) achieve comparable performance to AR models trained on 1.5B unique tokens (converged). Increasing the model size from 1B to 8B further unleashes the data potential, while AR doesn’t benefit from a larger model size under data constraint. DLMs also show negligible performance degradation when drastically reducing unique data from 96B to 0.5B tokens.