"First, running steps with a learning rate of 0 still decreases their predicted loss. "
-- They discussed this issue in section 6.3 and this issue could be resolved by adding a LR-weighted item to S2.
"Second, if I’m thinking through it correctly, the best learning rate schedule according to their formula should be just the highest LR that doesn’t diverge for most of training followed by an instant drop to ~0."
-- No, as shown in Figure 11 in their paper, the best learning rate schedule should be be just the highest LR that doesn’t diverge for most of training followed by a ~20% annealing steps drop to ~0. The conclusion is very similar as previous work like WSD scheduler and Hägele's work (https://arxiv.org/abs/2405.18392).
Thanks for the high quality comment! I've edited the description to call attention to their mitigation for the 0 learning rate issue.
Regarding the second point: I see that this is why they and the Hagele paper found empirically, but is there any math showing that this is what we'd expect from the present paper's formula? I didn't work through the math but I think (?) the game-the-formula solution is the abrupt dropoff I mentioned. Maybe I didn't read carefully enough, but Fig 11 assumes that annealing is happening and just sweeps the fraction of training--it doesn't consider the option of dropping to the minimum value immediately.
Yes, math decides everything and it does consider the situation you mentioned. Let me try to explain this.
Firstly, the option of dropping to the minimum value immediately is exactly considered, which is seen as "1-step annealing", and Figure 11 has shown the bad results of very small-step annealing.
Secondly, From the math perspective, momentum needs accumulation (just like a high speed of a ball needs a long slope). The equation 6 shows that S2 is based on the accmulation of all momentums.
For example, lr anneals 1->0 in only one step, the momentum will be 1 and S2 will be 1.
If lr 1->0 in 10 steps, momentum of each step will be
step 1: 0.1
steo 2: 0.1 + 0.1*f,
step 3: 0.1 + 0.1*f + 0.1*f^2
...
step 10: 0.1 + 0.1*f + 0.1*f^2 +... + 0.1*f^9
where f is the decay factor.
S2 is the sum of the momentums, which is obviously much larger than the 1-step annelaing situation.
Or more intuitively, the blue part in figure 1 will be very, very small if you do 1-step annealing. In summary, as the author stated, the art of LRS is the balance between forward and annealing area. 1-step annealing just breaks the balance.
PS: In section 6.1, the authors discussed that the optimal annealing step is very close to the momentum window size, that is 1/(1-f).
Ah, clarification: I'm not suggesting that one-step annealing *as the last step* should be the best. I'm suggesting that annealing in one step *instead of gradually* games the formula better. So you still have, say, 20% of training at a lower learning rate; it's just that it's a step function drop right at the 80% mark, followed by a fixed, small lr, instead of a smooth decay over the last 20%.
This approach maximizes the sum of the momentums (blue area) subject to the constraints that:
1) We don't drop the lr below 0 and
2) the lr has to stay fixed until a certain step
You can see this by considering the impulse response from a change in lr at any time step. Each change creates a geometric series of momentum terms (as you identify), and earlier changes have more terms in the series than later changes.
However, I'm not certain that this sudden drop minimizes their *overall* loss formula. Eyeballing it, I think you're better off doing a sudden drop and just tuning when it happens, but really what we need is a proof.
Yes, the situation you mentioned has a bit larger S2 than annealing gradually. But S1 of your mentioned situation can be much smaller than annealing gradually. So I tend to believe that their "overall" loss would be worse in your situation. I agree that what we need is a proof when something tradeoff between S1 and S2 happens.
I love these papers and the implications. I keep coming back to the value of discriminators. Many papers have proposed majority vote for best completions. Have any papers proposed majority vote for many completions x many judges? Of course, this introduces combinatoric scaling of test time compute, but if test time inference is ~free, this seems likely to extend the scaling laws, perhaps substantially.
I personally haven't seen it but have also been out of the loop for a few months. The multiple judges angle is interesting and makes me wonder what the optimal combination of [pseudo-]ensembling techniques for a fixed generator.
Nice work! I have some more comments.
"First, running steps with a learning rate of 0 still decreases their predicted loss. "
-- They discussed this issue in section 6.3 and this issue could be resolved by adding a LR-weighted item to S2.
"Second, if I’m thinking through it correctly, the best learning rate schedule according to their formula should be just the highest LR that doesn’t diverge for most of training followed by an instant drop to ~0."
-- No, as shown in Figure 11 in their paper, the best learning rate schedule should be be just the highest LR that doesn’t diverge for most of training followed by a ~20% annealing steps drop to ~0. The conclusion is very similar as previous work like WSD scheduler and Hägele's work (https://arxiv.org/abs/2405.18392).
Thanks for the high quality comment! I've edited the description to call attention to their mitigation for the 0 learning rate issue.
Regarding the second point: I see that this is why they and the Hagele paper found empirically, but is there any math showing that this is what we'd expect from the present paper's formula? I didn't work through the math but I think (?) the game-the-formula solution is the abrupt dropoff I mentioned. Maybe I didn't read carefully enough, but Fig 11 assumes that annealing is happening and just sweeps the fraction of training--it doesn't consider the option of dropping to the minimum value immediately.
Yes, math decides everything and it does consider the situation you mentioned. Let me try to explain this.
Firstly, the option of dropping to the minimum value immediately is exactly considered, which is seen as "1-step annealing", and Figure 11 has shown the bad results of very small-step annealing.
Secondly, From the math perspective, momentum needs accumulation (just like a high speed of a ball needs a long slope). The equation 6 shows that S2 is based on the accmulation of all momentums.
For example, lr anneals 1->0 in only one step, the momentum will be 1 and S2 will be 1.
If lr 1->0 in 10 steps, momentum of each step will be
step 1: 0.1
steo 2: 0.1 + 0.1*f,
step 3: 0.1 + 0.1*f + 0.1*f^2
...
step 10: 0.1 + 0.1*f + 0.1*f^2 +... + 0.1*f^9
where f is the decay factor.
S2 is the sum of the momentums, which is obviously much larger than the 1-step annelaing situation.
Or more intuitively, the blue part in figure 1 will be very, very small if you do 1-step annealing. In summary, as the author stated, the art of LRS is the balance between forward and annealing area. 1-step annealing just breaks the balance.
PS: In section 6.1, the authors discussed that the optimal annealing step is very close to the momentum window size, that is 1/(1-f).
Ah, clarification: I'm not suggesting that one-step annealing *as the last step* should be the best. I'm suggesting that annealing in one step *instead of gradually* games the formula better. So you still have, say, 20% of training at a lower learning rate; it's just that it's a step function drop right at the 80% mark, followed by a fixed, small lr, instead of a smooth decay over the last 20%.
This approach maximizes the sum of the momentums (blue area) subject to the constraints that:
1) We don't drop the lr below 0 and
2) the lr has to stay fixed until a certain step
You can see this by considering the impulse response from a change in lr at any time step. Each change creates a geometric series of momentum terms (as you identify), and earlier changes have more terms in the series than later changes.
However, I'm not certain that this sudden drop minimizes their *overall* loss formula. Eyeballing it, I think you're better off doing a sudden drop and just tuning when it happens, but really what we need is a proof.
Yes, the situation you mentioned has a bit larger S2 than annealing gradually. But S1 of your mentioned situation can be much smaller than annealing gradually. So I tend to believe that their "overall" loss would be worse in your situation. I agree that what we need is a proof when something tradeoff between S1 and S2 happens.
Not sure what happened, but Substack unsubbed me from your newsletter.
Might want to check if that's happened with other people.
I love these papers and the implications. I keep coming back to the value of discriminators. Many papers have proposed majority vote for best completions. Have any papers proposed majority vote for many completions x many judges? Of course, this introduces combinatoric scaling of test time compute, but if test time inference is ~free, this seems likely to extend the scaling laws, perhaps substantially.
I personally haven't seen it but have also been out of the loop for a few months. The multiple judges angle is interesting and makes me wonder what the optimal combination of [pseudo-]ensembling techniques for a fixed generator.