**Jinjie Ni and the team**
Released on Aug 09 2025
https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpJek9XUTRaakF6WVRnMk5qZ3dNR0ZpTVRrMlpUUTVPVEk0WXpBeE9XRmpJbjA9
Recent research highlights the potential of diffusion language models (DLMs). Owing to the parallel decoding design, they can generate thousands of tokens per second, resulting in exceptionally low latency for real-world applications [17][18][19]. Moreover, several recent DLMs have demonstrated performance on par with autoregressive (AR) models [8][9].
But is speed their only advantage? After rigorous investigations over the past few months, we discovered a more striking trait: diffusion models are super data learners under fixed data budgets. That is, given the same number of unique pre-training tokens, diffusion models consistently outperform AR counterparts of equal size—by trading additional FLOPs for improved learning. This reflects a roughly >3x data potential of AR models.
Such data potential is increasingly valuable as we approach the limits of available pre-training data [20], especially given that AR models show diminishing returns after just four epochs of data reuse [11]. Coincidentally, a concurrent study [1] explores similar topics. However, our careful analysis reveals several methodological issues in [1] that may lead to flawed conclusions.
In this post, we present preliminary results providing strong evidence for a clear “crossover” point where diffusion models outperform AR models. We then delve into the learning behavior of diffusion models to shed light on how this advantage emerges. Finally, we offer a detailed critique of the problematic methodologies in [1], aiming to guide more robust future research.
<aside> ✨
Table of Content
<aside> ✨
Section Highlights
Figure A: The performance comparison of autoregressive (AR) and masked diffusion models (Diffusion) when repeating on a limited portion of data. All models are trained on 96B total tokens (including repetition), varying the unique tokens from 0.5B to 96B. Diffusion models exploit the data better through more repetition on limited unique data. More unique tokens requires more repetition to see the crossover, where the high unique token runs postpone the crossover beyond our 96B token observation scope.
Overall Setup: Dense 1B/8B models trained on a fixed 96B-token budget, varying unique tokens from 0.5B to 96B. A 1B DLM was also trained for 480 epochs on 1B unique tokens.
Figure A presents an extensive set of results, providing compelling evidence that, by repeating on normal web data, masked DLMs outperform AR counterparts across model sizes in data-constrained settings, demonstrating significantly greater potential without encountering performance saturation.
Overall, our results suggest DLMs exhibit more than threefold greater ultimate data potential compared to autoregressive models. This estimate is empirically supported within our experiments, as DLMs trained on only 0.5B unique tokens (not converged) achieve comparable performance to AR models trained on 1.5B unique tokens (converged). Increasing the model size from 1B to 8B further unleashes the data potential, while AR doesn’t benefit from a larger model size under data constraint. DLMs also show negligible performance degradation when drastically reducing unique data from 96B to 0.5B tokens.
Under compute-bound scenarios—where data supply is abundant—AR models fit the training data better and thus achieve superior end-of-training performance. However, under data-bound conditions—reflecting the current reality of rapidly increasing compute power outpacing data availability—diffusion models significantly surpass AR models at some point. A deeper analysis of this phenomenon is presented in Section 2.