Discussion about this post

User's avatar
subway's avatar

Nice work! I have some more comments.

"First, running steps with a learning rate of 0 still decreases their predicted loss. "

-- They discussed this issue in section 6.3 and this issue could be resolved by adding a LR-weighted item to S2.

"Second, if I’m thinking through it correctly, the best learning rate schedule according to their formula should be just the highest LR that doesn’t diverge for most of training followed by an instant drop to ~0."

-- No, as shown in Figure 11 in their paper, the best learning rate schedule should be be just the highest LR that doesn’t diverge for most of training followed by a ~20% annealing steps drop to ~0. The conclusion is very similar as previous work like WSD scheduler and Hägele's work (https://arxiv.org/abs/2405.18392).

Expand full comment
Devansh's avatar

Not sure what happened, but Substack unsubbed me from your newsletter.

Might want to check if that's happened with other people.

Expand full comment
6 more comments...

No posts