2021-12-4: Sparsity is Enough in Scaling Transformers, Sparse ImageNet transfer

May 03, 2022

Adaptive Optimization with Examplewise Gradients

They try to exploit per-sample gradient information (rather than the normal per-batch grads averaged across samples) to improve adam. Seems to be a negative result so far.

How Well Do Sparse Imagenet Models Transfer?

"In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups."

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective.

Some math + CIFAR evidence that larger batches converge to sharper minima (though not if you train infinitely long---they eventually catch up)

How Smart Guessing Strategies Can Yield Massive Scalability Improvements for Sparse Decision Tree Optimization

They train sparse (ie, smaller, more interpretable) decision trees with something resembling distillation + some intuitive heuristics. Tree-based models are less popular to study than deep learning, but extremely important in practice, so this could be pretty valuable.

⭐ Sparse is Enough in Scaling Transformers

Google paper getting no training-time improvement, but up to 37x single-sequence CPU inference speedup at iso accuracy. Not sure how it works because wasn't easy to skim. Probably the most interesting/meaty paper this week.

Davis Summarizes Papers

Discussion about this post