The results for pre-gated MoE probably only *look* good, because of a slow implementation of the model to begin with. Their baseline speed for switch-base on a PCIe A100-80G is 120~150 generated tokens/s on batch size 1, which is a little bit pathetic for ~200M activated params these days -- it's comparable to the tg128 speed of 4-bit quantized llama-7b.
Because the PCIe latency bottleneck still remains regardless of how much more efficiently the flops are used, the proportion of latency consumed by loading experts should become far worse.
Maybe it will work better with Grace Hopper's bridged memory.
The results for pre-gated MoE probably only *look* good, because of a slow implementation of the model to begin with. Their baseline speed for switch-base on a PCIe A100-80G is 120~150 generated tokens/s on batch size 1, which is a little bit pathetic for ~200M activated params these days -- it's comparable to the tg128 speed of 4-bit quantized llama-7b.
Because the PCIe latency bottleneck still remains regardless of how much more efficiently the flops are used, the proportion of latency consumed by loading experts should become far worse.
Maybe it will work better with Grace Hopper's bridged memory.
Do you want to collab? Seems like we have synergistic audiences