7 Comments

Davis, I've loved reading your summaries and appreciate this latest on DBRX. However, I am wondering why you included this "Since OpenAI had GPT-3 before we even existed as a company, this indicates that the gap between the leaders and others is narrowing (as measured in time, not necessarily model capabilities)." Databricks has been around a lot longer than either OpenAI or GPT3 (2013), thus this note leaves me a bit confused.

Expand full comment
author

Sorry, I meant "we" as in MosaicML (now the databricks GenAI team). Good catch.

Expand full comment

Finally! You’re back! Now I can keep up with arxiv again :)

Expand full comment

Is qk normalization refer to qk-layernorm? or something others?

And just wondering why qk clipping rather than qk normalization?

Expand full comment
author

Yes, qk layernorm. Clipping operates elementwise and can be more easily fused into other kernels.

Expand full comment

I'm new here. Nice newsletter, thanks for writing it!

Can you say more about why you typically prefer compute time held constant? Seems unintuitive to me.

> For maybe the first time in my life, I wish someone had held FLOPs constant instead of compute time.

Expand full comment
author

FLOPs are extremely easy to game because they don't take into account memory bandwidth consumption, communication, or other system bottlenecks. You can trivially jack up your param count vs accuracy numbers by throwing in a bunch of bandwidth-bound ops like grouped convolutions. A large fraction of purported gains in new architectures are just people trading speed for accuracy using bandwidth-bound operations but not thinking of it that way (e.g., fancy spatial mixing/attention schemes in many vision models).

In contrast, wall time is intrinsically useful (people don't want to wait) and closely tied to cost (either through direct hourly billing or hardware depreciation). The main way it fails is when there's implementation bias, in which one algorithm appears better than another but the results are confounded by implementation quality/effort.

Expand full comment