Mixture-of-Experts with Expert Choice Routing Instead of choosing the top-k experts for each token, you choose the top-k tokens per expert. Seems to work even better. I actually started coding this independently last month (scooped!), and the subtleties are: 1) it makes your routing function super cheap, which is great, but 2) you end up summing different numbers of activation tensors for each token, which is hard to make efficient. You can
2022-2-27: Flash, Expert Choice Routing, Effective MoE, Merging inputs and tokens
2022-2-27: Flash, Expert Choice Routing…
2022-2-27: Flash, Expert Choice Routing, Effective MoE, Merging inputs and tokens
Mixture-of-Experts with Expert Choice Routing Instead of choosing the top-k experts for each token, you choose the top-k tokens per expert. Seems to work even better. I actually started coding this independently last month (scooped!), and the subtleties are: 1) it makes your routing function super cheap, which is great, but 2) you end up summing different numbers of activation tensors for each token, which is hard to make efficient. You can