Qwen's attention gating research just won NeurIPS 2025's best paper award, and for good reason. Their systematic approach shows how a relatively simple modification can solve some of transformer training's biggest headaches - instability and scaling limitations. The "little trick" framing undersells what could be a foundational improvement for large model training.
Qwen's attention gating research just won NeurIPS 2025's best paper award, and for good reason. Their systematic approach shows how a relatively simple modification can solve some of transformer training's biggest headaches - instability and scaling limitations. 🧠 The "little trick" framing undersells what could be a foundational improvement for large model training.