2021-12-4: Sparsity is Enough in Scaling Transformers, Sparse ImageNet transfer
Adaptive Optimization with Examplewise Gradients
They try to exploit per-sample gradient information (rather than the normal per-batch grads averaged across samples) to improve adam. Seems to be a negative result so far.
How Well Do Sparse Imagenet Models Transfer?
"In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups."
On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective.
Some math + CIFAR evidence that larger batches converge to sharper minima (though not if you train infinitely long---they eventually catch up)
How Smart Guessing Strategies Can Yield Massive Scalability Improvements for Sparse Decision Tree Optimization
They train sparse (ie, smaller, more interpretable) decision trees with something resembling distillation + some intuitive heuristics. Tree-based models are less popular to study than deep learning, but extremely important in practice, so this could be pretty valuable.
⭐ Sparse is Enough in Scaling Transformers
Google paper getting no training-time improvement, but up to 37x single-sequence CPU inference speedup at iso accuracy. Not sure how it works because wasn't easy to skim. Probably the most interesting/meaty paper this week.