Home | Twitter / X | Google Scholar | Github

💼 National University of Singapore 📮 [email protected]

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpFNE1HRmhPVFl3TkRnMk9EUTRNelU1TXpBNU1EWmhNVFUyTWpsaU1HTTBJbjA9

Table of Content

Aug 2024 | Don’t Build Random Evals: Principles for General-Purpose Model Evaluation

Jinjie Ni | National University of Singapore

Model evaluation is simple and complicated. It’s simple to build a random evaluation pipeline that reflects random abilities of models. However, creating a “correct” evaluation requires sophisticated considerations. The great posts by Jason Wei and Clémentine Fourrier have shared some essential points on how to conduct LLM evaluation.

In this blog post, we talk about how to build a “correct” model evaluation that is useful in a long term and provide the basic principles.

What is model evaluation? Model evaluation is an efficient proxy to measure how good a model will perform in the real-world use cases before they are actually used. This understanding is very important, and is the core to the remaining discussion of this post. Based on this understanding, to build a “correct” model evaluation that is useful in a long term, there are two principles to follow: Generalizable and Efficient.

Aug 2025 | Diffusion Language Models are Super Data Learners

Jinjie Ni | National University of Singapore

We pre-trained DLMs and AR models from scratch for up to 8B parameters and 480B tokens. DLMs demonstrate > 3x greater data potential compared to autoregressive (AR) models. Notably, a 1B-parameter masked diffusion model achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. Note that more repetitions could further improve its performance, as no signs of diminishing returns were observed.

DLMs are super-dense models that consume more FLOPs than dense AR models. Training DLMs to fully leverage the data typically demands at least two orders of magnitude more FLOPs. During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines. In addition, the more expressive bidirectional attention enabled by the diffusion objective allows bidirectional modeling of the language data, which is not fully causal, to fully squeeze its value.

Our concurrent work, “Diffusion Beats Autoregressive in Data-Constrained Settings”, contains methodological issues potentially leading to problematic conclusions, including problematic diffusion loss formulation, invalid metrics for comparison, unfair settings for AR models, and problematic scaling law formulation. All of which might lead to potentially misleading results and conclusions.