2023-10-16 arXiv roundup: Cornucopia of easy…

Oct 17, 2023

Also, I was on the AI Stories podcast!

4 Comments

Oct 22, 2023

re: sparse backprop, do you not find it odd that they report their switch transformer baseline as underperforming the dense model baseline in BLEU score under all circumstances?

I don't believe this is the expected outcome based on the original switch transformers paper.

Expand full comment

Victualis

Oct 18, 2023

Regarding the Downscaling paper, the authors' take is weird. My conclusion from their work is that downscaling LLMs retains learning ability even though it scrubs facts. This seems super useful if one wants to stop a system regurgitating training set data, prior to finetuning it on a different training set which it is OK to remember. In other words: build a large model with lots of low quality data, make it smaller, then finish training the small model on highly curated data. The point here is to use an existing system as the basis, saving the initial training phase of the new system, and allowing reuse of large training runs instead of starting from a random set of weights.

Expand full comment

IJCAI 2023

Oct 18, 2023

I'm delighted that you're still able to write this. I miss the weekly editions, but understand that you have a new life after the acquisition. You deserve the rewards!

Expand full comment

Nathan Lambert

Oct 18, 2023

Really niche comment - the RLHF length paper, we did A TON of work at HuggingFace trying to understand why running SFT on longer sequences improves a lot of benchmarks. It really never became that clear, but we repeated it on a lot of different datasets. We think it was why Lima performed so well too.

I don't think that's safe to say its the same as running RLHF. The things our evals are stable for, they match; overall, not really.

Expand full comment

Davis Summarizes Papers

2023-10-16 arXiv roundup: Cornucopia of easy…