Jinjie Ni and team

[email protected]

Released on Sep 30 2025

GitHub: https://github.com/JinjieNi/OpenMoE2

https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSm9XakpyTm5kemIxZHJaRnBqVEZkbU5YRndWaUlzSW5CaFoyVkpaQ0k2SWpJek9XUTRaakF6WVRnMk5qZ3dNR0ZpTVRrMlpUUTVPVEk0WXpBeE9XRmpJbjA9


Diffusion language models (DLMs) offer several key advantages over autoregressive (AR) models. Their parallel decoding scheme enables certain commercial DLMs to generate thousands of tokens per second, achieving substantially lower latency in practical applications compared to AR models [20][21][22]. In addition, [20][21][22] also demonstrated strong performance in coding and mathematical reasoning tasks, likely attributable to their higher compute-to-parameter ratio. Notably, [2] shows that DLMs possess greater data efficiency than AR models in low-data regimes—so-called "intelligence crossovers"—highlighting a fundamental advantage of DLMs under data constraint.

Considering these advantages, we aim to establish a principled scaling path for DLMs to further extend their capability frontier. In this work, we present OpenMoE 2, the first systematic study of pre-training sparse diffusion language models (SDLM) from scratch for parameter scaling. We demonstrate that DLMs can be scaled effectively through either FLOPs free parameter expansion (”moefying”) or parameter-free FLOPs scaling (diffusion steps scaling), offering a way to flexibly adjust the compute-to-parameter ratio when scaling up.

Moreover, diffusion and MoEs form a double-win combination on repeated data: diffusion MoE consistently beats the FLOPs-matching dense DLM and all AR models, while the AR MoE performs the worst among all AR counterparts. A strictly controlled experiment gives a striking conclusion: models with a low compute-to-parameter ratio, such as AR MoEs, perform exceptionally poorly on repeated data, worse than both their FLOPs-matched and parameter-matched dense counterparts. In contrast, MoEs trained with diffusion objective avoid this issue, benefiting from a high and tunable compute-to-parameter ratio.

Importantly, diffusion MoE inherently supports expert-choice routing, enabling higher throughput, adaptive computation, and perfect load balancing without auxiliary objectives.

We further conduct extensive ablations on architectural design choices, yielding insights that inform our final scaling strategy (TBD). Below, we summarize the key experiments and their corresponding conclusions.

Summary of Architectural Ablations

Design Choice Description Section Conclusion
MoE vs. Dense EC and TC diffusion MoEs compared with their FLOP-matching and parameter-matching dense models. 1 diffusion MoEs fall between their FLOPs- and parameter-matching dense counterparts, leaning to the parameter-matching one on knowledge intensive tasks.
Diffusion MoE vs. AR MoE Diffusion and AR MoE vs. their FLOP-matching and parameter-matching dense models in multi-epoch training. 2 The worst DLM beats the best AR model; the diffusion MoE consistently beats its FLOPs-matching dense counterparts while AR MoE performs the worst among all AR counterparts.
token-choice vs. expert-choice Token-choice routing vs. expert-choice routing. 3 No noticeable performance difference while expert-choice router enables higher throughput, adaptive computation, and perfect load balancing without auxiliary objectives.
token-wise load balancing Adding load-balancing loss along the token dimension, specifically designed for expert-choice MoEs. 4 Using load balancing consistently performs worse. Highlighting the importance of adaptive computing.
shared experts w/ shared experts vs. w/o 5 Shared experts serve as the backstop for token dropping in expert-choice MoEs, consistently raising the performance.
expert granularity Scaling up the expert numbers while keeping the total parameters constant. 6 The performance first increase then decrease as we scale up the expert granularity.
upcycling Train from scratch vs. upcycling from a dense model 7 The upcycled variant has a higher starting point but get surpassed by the MoE trained from scratch in the middle of training.
MoE layer skipping Using vanilla FFN layers for the first few transformer layers for routing stability. 8 Skipping the first two layers has no noticeable gains.
scaling factors Adding scaling factors for token-choice and expert-choice routing to balance the output norm of shared and routing experts. 9 No noticeable gain in the token-choice MoEs; Observe lower training and validation loss for expert-choice MoEs.
batch-level EC vs. sequence-level EC Performing expert choice at the batch level or sequence level. 10 Batch-level outperforms sequence-level consistently while sacrificing the sequence independence.
softmax vs. sigmoid The router activation function. 11 No noticeable difference, softmax performs slightly better on MMLU.

Default Experimental Settings

Codebase Megatron-LM
Data Nemotron-CC
Model Size 8B total parameters, 1.7B activated
Model Arch General: c100k tokenizer, RoPE, SwiGLU, pre-layer RMSNorm, bias-free, qk normalization;

MoE: expert-choice, 66 experts with 2 shared, 0.001 z-loss, post-softmax router, batch-level EC, no scaling factors, no token-wise load-balancing loss, trained from scratch, all layers use MoE. | | Hyperparamters | 4096 sequence length, 1024 batch size, cosine schedule with 4e-4 peak learning rate, 0.1 weight decay with AdamW. | | Eval | c4-en val; 0-shot HellaSwag; 0-5 shot MMLU-val |

1. Parameter–Compute Trade-off

Figure 1: Benchmarking FLOPs-matching and parameter-matching dense DLMs with expert-choice and token-choice diffusion MoEs. All training settings are strictly the same except the model size and architectures.

Figure 1: Benchmarking FLOPs-matching and parameter-matching dense DLMs with expert-choice and token-choice diffusion MoEs. All training settings are strictly the same except the model size and architectures.

MoE enables scaling to highly parameterized networks while activating only a subset of parameters during training and inference, improving efficiency [1]. Thus, when transforming a dense transformer into an MoE, there are two possible interpretations:

Therefore, when comparing the dense models with their sparse counterparts, there are two ways to think about the problem: (1) in the FLOPs-matching setting, how much it benefits from sole parameter expansion? (2) in the parameter-matching setting, how much it costs to cut off the FLOPs per task?