Mar 15·edited Mar 15

Nice post!

On MosaicBERT, don't the cost reductions basically come from HW improvements? By my calculations, it seems like MosaicBERT was trained on more FLOP than BERT-Base -- I count like ~1.6e18 FLOP (F32, or ~3e18 with F16) vs ~1.2e18 FLOP (F35), and that's assuming the same HW utilization even though I'm guessing the A100 is better on that front than the v2 TPU (though my calculations could definitely be off). Or am I misunderstanding something here?

Expand full comment