2022-12-4 arXiv roundup: New best MoE implementation, 3x faster transformer inference
dblalock.substack.com
This newsletter made possible by MosaicML. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts They cast MoE modules as block sparse operations and use this framing to speed up MoE by a lot in the num_experts > num_devices regime. More precisely, they point out that you can look at an MoE layer as an activation-sparse matmul. Each expert corresponds to a group of columns that are nonzero together, and top-k routing assumes that k groups of columns are nonzero for a given token.
2022-12-4 arXiv roundup: New best MoE implementation, 3x faster transformer inference
2022-12-4 arXiv roundup: New best MoE…
2022-12-4 arXiv roundup: New best MoE implementation, 3x faster transformer inference
This newsletter made possible by MosaicML. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts They cast MoE modules as block sparse operations and use this framing to speed up MoE by a lot in the num_experts > num_devices regime. More precisely, they point out that you can look at an MoE layer as an activation-sparse matmul. Each expert corresponds to a group of columns that are nonzero together, and top-k routing assumes that k groups of columns are nonzero for a given token.