2023-3-12 arxiv roundup: Pretraining BERT for…

Davis Blalock

Mar 15, 2023

This newsletter made possible by MosaicML.

Read →

2 Comments

Simultan

Mar 15, 2023Edited

Nice post!

On MosaicBERT, don't the cost reductions basically come from HW improvements? By my calculations, it seems like MosaicBERT was trained on more FLOP than BERT-Base -- I count like ~1.6e18 FLOP (F32, or ~3e18 with F16) vs ~1.2e18 FLOP (F35), and that's assuming the same HW utilization even though I'm guessing the A100 is better on that front than the v2 TPU (though my calculations could definitely be off). Or am I misunderstanding something here?

Expand full comment

Reply (1)

Davis Blalock

Mar 15, 2023

Part of it is hardware, especially vs the original paper. But we ablated each of the other changes and all of them helped. We also had the fastest training on iso hardware a few months ago (as measured by MLPerf https://www.mosaicml.com/blog/mlperf-nlp-nov2022), and MosaicBERT has gotten faster since then.

Expand full comment

Davis Summarizes Papers

2023-3-12 arxiv roundup: Pretraining BERT for…