Co-Adaptation Under Reinforcement Learning: Measuring How Scaled RL Reshapes Expert Routing in Production Mixture-of-Experts Models
Reinforcement learning (RL) post-training updates all parameters of a Mixture-of-Experts (MoE) model jointly---expert weights, gate weights, and routing biases co-adapt simultaneously---yet its effect on expert routing has never been systematically measured, let alone causally tested. We present the first layer-by-layer, domain-stratified analysis of this co-adaptation in production-scale MoE models, together with a causal intervention that disentangles routing from expert weight changes. MiniMax-M2.1 and M2.5 share identical architecture---229B parameters, 62 transformer layers, 256 experts, top-8 sigmoid gating---and both undergo SFT and RL via CISPO, but M2.5 extends training with ~2x more environments and process reward signals, isolating the marginal effect of scaled RL. By recording per-token routing decisions across all 62 layers over 847,159 tokens spanning six domains, we find that scaled RL produces domain-selective routing changes. Code routing concentrates onto fewer experts (Delta H = -0.23 bits, p < 0.01; robust under subsampling and sample-level bootstrap), while reasoning (+0.11 bits), general (+0.17 bits), and instruction-following (+0.17 bits) routing disperses. To test whether these routing changes cause capability differences, we perform a gate-swap experiment: transplanting M2.5's gate weights into M2.1's body (and vice versa) across all 62 layers and measuring per-domain cross-entropy loss. The results reveal that routing changes are structurally real but functionally secondary: swapping gates produces <1% perplexity change on code, while reverting M2.5 to M2.1's routing improves general-domain perplexity by 7.8%---indicating that the additional RL-induced routing specialization taxes non-target domains. To test generality, we replicate the observational analysis on DeepSeek V3 (671B parameters, 256 experts with grouped top-8 routing) comparing V3-Base (pre-trained) against V3 (SFT+RL). DeepSeek exhibits substantially higher routing stability (mean Jaccard = 0.67 vs. MiniMax's 0.60; mean top-1 agreement = 0.85 vs. 0.60) and uniformly small entropy changes (|Delta H| < 0.12 bits across all domains), in stark contrast to MiniMax's asymmetric -0.23 to +0.17 bit range. DeepSeek's two-stage grouped routing (top-4 of 8 groups, then top-8 of 128 experts) acts as an architectural stabilizer: group-level Jaccard remains 0.82 even as expert-level agreement declines, with 73.5% of routing divergence occurring within preserved groups. These cross-architecture results establish that post-training routing changes are universal but architecture-modulated: grouped routing constrains the geometry of co-adaptation, dampening the domain-selective specialization observed under direct routing.