2022-1-30: Xformers, ConvMixer, Megatron-Turing NLG 530B
Convolutional Xformers for Vision Not sure what to make of the overall efficacy of the approach, but they report lifts from 1) switching optimizers from AdamW to SGD during training, and 2) turning off randaugment near the end of training, both of which seem like actionable (if somewhat mysterious) optimizations.
Weight Expansion: A New Perspective on Dropout and Generalization Increasing a quantity similar to the determinant of the weight covariance matrix seems to have a causal effect on generalization. Dropout might work by being a cheap way of obtaining this "weight expansion."
Patches Are All You Need? The ConvMixer paper. The architecture that fits in a tweet and outperforms ResNet-50 and many other sophisticated networks.
Fast Differentiable Matrix Square Root Still not fast enough that you'd like to backprop through a matrix sqrt, but a nice new hammer for the mathematical toolbox.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model "We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters"