Home | Twitter / X | Google Scholar | Github

💼 National University of Singapore 📮 [email protected]

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpFNE1HRmhPVFl3TkRnMk9EUTRNelU1TXpBNU1EWmhNVFUyTWpsaU1HTTBJbjA9

Table of Content

Aug 2024 | Don’t Build Random Evals: Principles for General-Purpose Model Evaluation

Jinjie Ni | National University of Singapore

Model evaluation is simple and complicated. It’s simple to build a random evaluation pipeline that reflects random abilities of models. However, creating a “correct” evaluation requires sophisticated considerations. The great posts by Jason Wei and Clémentine Fourrier have shared some essential points on how to conduct LLM evaluation.

In this blog post, we talk about how to build a “correct” model evaluation that is useful in a long term and provide the basic principles.

What is model evaluation? Model evaluation is an efficient proxy to measure how good a model will perform in the real-world use cases before they are actually used. This understanding is very important, and is the core to the remaining discussion of this post. Based on this understanding, to build a “correct” model evaluation that is useful in a long term, there are two principles to follow: Generalizable and Efficient.

Aug 2025 | Diffusion Language Models are Super Data Learners

Jinjie Ni | National University of Singapore

We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demonstrate > 3x greater data potential compared to autoregressive (AR) models. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. Note that more repetitions could further improve its performance, as no signs of diminishing returns were observed.

DLMs are super-dense models that consume more FLOPs than dense AR models. Training DLMs to fully leverage the data typically demands at least two orders of magnitude more FLOPs. During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines. In addition, the more expressive bidirectional attention enabled by the diffusion objective allows bidirectional modeling of the language data, which is not fully causal, to fully squeeze its value.

Our concurrent work, “Diffusion Beats Autoregressive in Data-Constrained Settings”, contains methodological issues potentially leading to problematic conclusions, including problematic diffusion loss formulation, invalid metrics for comparison, unfair settings for AR models, and problematic scaling law formulation. All of which might lead to potentially misleading results and conclusions.

Oct 2025 | OpenMoE 2: Sparse Diffusion Language Models

Jinjie Ni | National University of Singapore

Diffusion language models (DLMs) offer several key advantages over autoregressive (AR) models. Their parallel decoding scheme enables certain commercial DLMs to generate thousands of tokens per second, achieving substantially lower latency in practical applications compared to AR models [20][21][22]. In addition, [20][21][22] also demonstrated strong performance in coding and mathematical reasoning tasks, likely attributable to their higher compute-to-parameter ratio. Notably, [2] shows that DLMs possess greater data efficiency than AR models in low-data regimes—so-called "intelligence crossovers"—highlighting a fundamental advantage of DLMs under data constraint.

Considering these advantages, we aim to establish a principled scaling path for DLMs to further extend their capability frontier. In this work, we present OpenMoE 2, the first systematic study of pre-training sparse diffusion language models (SDLM) from scratch for parameter scaling. We demonstrate that DLMs can be scaled effectively through either FLOPs free parameter expansion (”moefying”) or parameter-free FLOPs scaling (diffusion steps scaling), offering a way to flexibly adjust the compute-to-parameter ratio when scaling up.

Moreover, diffusion and MoEs form a double-win combination on repeated data: diffusion MoE consistently beats the FLOPs-matching dense DLM and all AR models, while the AR MoE performs the worst among all AR counterparts. A strictly controlled experiment gives a striking conclusion: models with a low compute-to-parameter ratio, such as AR MoEs, perform exceptionally poorly on repeated data, worse than both their FLOPs-matched and parameter-matched dense counterparts. In contrast, MoEs trained with diffusion objective avoid this issue, benefiting from a high and tunable compute-to-parameter ratio.

Importantly, diffusion MoE inherently supports expert-choice routing, enabling higher throughput, adaptive computation, and perfect load balancing without auxiliary objectives.