2022-3-6: N:M sparse attention, Rethinking demonstrations, Shift instead of attention
dblalock.substack.com
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models Factorize the experts and reuse the biggest matrix in the factorization across all the experts. Maybe outperforms switch transformer-style MoE when grafted onto GPT-2? At the operating point, about 1.25x more params overall, whereas switch-like baseline was 4.67x. Seems to do better than regular MoE on WikiText-2 perplexity:
2022-3-6: N:M sparse attention, Rethinking demonstrations, Shift instead of attention
2022-3-6: N:M sparse attention, Rethinking…
2022-3-6: N:M sparse attention, Rethinking demonstrations, Shift instead of attention
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models Factorize the experts and reuse the biggest matrix in the factorization across all the experts. Maybe outperforms switch transformer-style MoE when grafted onto GPT-2? At the operating point, about 1.25x more params overall, whereas switch-like baseline was 4.67x. Seems to do better than regular MoE on WikiText-2 perplexity: