re: sparse backprop, do you not find it odd that they report their switch transformer baseline as underperforming the dense model baseline in BLEU score under all circumstances?
I don't believe this is the expected outcome based on the original switch transformers paper.
Regarding the Downscaling paper, the authors' take is weird. My conclusion from their work is that downscaling LLMs retains learning ability even though it scrubs facts. This seems super useful if one wants to stop a system regurgitating training set data, prior to finetuning it on a different training set which it is OK to remember. In other words: build a large model with lots of low quality data, make it smaller, then finish training the small model on highly curated data. The point here is to use an existing system as the basis, saving the initial training phase of the new system, and allowing reuse of large training runs instead of starting from a random set of weights.
I'm delighted that you're still able to write this. I miss the weekly editions, but understand that you have a new life after the acquisition. You deserve the rewards!
Really niche comment - the RLHF length paper, we did A TON of work at HuggingFace trying to understand why running SFT on longer sequences improves a lot of benchmarks. It really never became that clear, but we repeated it on a lot of different datasets. We think it was why Lima performed so well too.
I don't think that's safe to say its the same as running RLHF. The things our evals are stable for, they match; overall, not really.
re: sparse backprop, do you not find it odd that they report their switch transformer baseline as underperforming the dense model baseline in BLEU score under all circumstances?
I don't believe this is the expected outcome based on the original switch transformers paper.
Regarding the Downscaling paper, the authors' take is weird. My conclusion from their work is that downscaling LLMs retains learning ability even though it scrubs facts. This seems super useful if one wants to stop a system regurgitating training set data, prior to finetuning it on a different training set which it is OK to remember. In other words: build a large model with lots of low quality data, make it smaller, then finish training the small model on highly curated data. The point here is to use an existing system as the basis, saving the initial training phase of the new system, and allowing reuse of large training runs instead of starting from a random set of weights.
I'm delighted that you're still able to write this. I miss the weekly editions, but understand that you have a new life after the acquisition. You deserve the rewards!
Really niche comment - the RLHF length paper, we did A TON of work at HuggingFace trying to understand why running SFT on longer sequences improves a lot of benchmarks. It really never became that clear, but we repeated it on a lot of different datasets. We think it was why Lima performed so well too.
I don't think that's safe to say its the same as running RLHF. The things our evals are stable for, they match; overall, not really.