2 Comments

Nice post!

On MosaicBERT, don't the cost reductions basically come from HW improvements? By my calculations, it seems like MosaicBERT was trained on more FLOP than BERT-Base -- I count like ~1.6e18 FLOP (F32, or ~3e18 with F16) vs ~1.2e18 FLOP (F35), and that's assuming the same HW utilization even though I'm guessing the A100 is better on that front than the v2 TPU (though my calculations could definitely be off). Or am I misunderstanding something here?

Expand full comment

Part of it is hardware, especially vs the original paper. But we ablated each of the other changes and all of them helped. We also had the fastest training on iso hardware a few months ago (as measured by MLPerf https://www.mosaicml.com/blog/mlperf-nlp-nov2022), and MosaicBERT has gotten faster since then.

Expand full comment