Discussion about this post

User's avatar
Sherman's avatar

re: sparse backprop, do you not find it odd that they report their switch transformer baseline as underperforming the dense model baseline in BLEU score under all circumstances?

I don't believe this is the expected outcome based on the original switch transformers paper.

Expand full comment
Victualis's avatar

Regarding the Downscaling paper, the authors' take is weird. My conclusion from their work is that downscaling LLMs retains learning ability even though it scrubs facts. This seems super useful if one wants to stop a system regurgitating training set data, prior to finetuning it on a different training set which it is OK to remember. In other words: build a large model with lots of low quality data, make it smaller, then finish training the small model on highly curated data. The point here is to use an existing system as the basis, saving the initial training phase of the new system, and allowing reuse of large training runs instead of starting from a random set of weights.

Expand full comment
2 more comments...

No posts