2022-2-6: Highlights from all the ICML2022 submissions
Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods Messing with data ordering within a batch. No positive results, and such strong negative results that I'm more inclined to believe strong negative results I’ve gotten myself in the past. Maybe some interesting theory in here but didn't really look at it.
Even Simpler Deterministic Matrix Sketching The entire abstract: "This paper provides a one-line proof of Frequent Directions (FD) for sketching streams of matrices. The simpler proof arises from sketching the covariance of the stream of matrices rather than the stream itself." FD is a classic approach to basically online SVD, and Liberty is the guy who invented it. Mostly my inner approximate matmul researcher appreciates how elegant this 1-pager is.
Formal Mathematics Statement Curriculum Learning OpenAI solves math olympiad problems. Not relevant right now, but unifying classical + connectionist approaches might be a powerful way to give models a ton of knowledge + inductive bias, and is something I'm keeping an eye on.
Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers They claim to beat zero-infinity, and have some optimizations it doesn't ("such as input-batch grouping and configuration search for principled layer packing."). But they only evaluate on a single 4x1080 ti node (which is pre-tensor core). So not sure what to make of this. Would require a really detailed reading to properly scrutinize. I see this as an indication that there's still room for algorithmic optimization on top of / within deepspeed.
Deep Layer-wise Networks Have Closed-Form Weights So I'm bullish on closed-form updates, and this paper takes it farther by providing an algorithm that directly provides closed-form expressions for the weights in an MLP. Results aren't as good as SGD and are only evaluated on small datasets, unsurprisingly, but in at least one case it got decent accuracy 1000x faster. I'm gonna need to throw some serious reading at this to fully understand it, but might be promising as an initialization scheme.
Robust Training of Neural Networks using Scale Invariant Architectures They get vanilla SGD working less poorly on BERT by modifying the architecture to make it invariant to weight scaling. Probably not practical, but they do have a principled way of setting a gradient clipping threshold that might be useful to try next time you want to clip some gradients.
Accelerating DNN Training with Structured Data Gradient Pruning Proposed a pruning scheme compatible with ampere sparsity. But didn’t bother to actually wrap the Cutlass functions in torch ops, so just "estimated" speedups.
⭐ Unified Scaling Laws for Routed Language Models Scaling laws for MoEs. One nugget I found is that they do the sinkhorn transform on the token-expert affinity matrix and that apparently works as well as fancier balancing schemes.
⭐ Progressive Distillation for Fast Sampling of Diffusion Models Up to 2000x inference-time speedup of diffusion models, which are starting to replace GANs in a lot of papers. Basically, instead of iteratively sampling and refining for 1000s of iterations, you can keep distilling a model so that it works even when sampling for only a few iterations.
Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction You can get away with storing the activations used to backprop through nonlinearities in fewer bits. Kind of a known result, but maybe sorta does better than current best (ActNN from berkeley). Method is clever and optimal in an MSE sense; can just solve for best piecewise constant approximation for a given activation function with dynamic programming. Also, Appendix B has breakdowns of how much activation memory and how much nonlinearity activation memory different models require per sample (it's about 25% of the total activation memory). Maybe a useful reference more generally. One point this raises that I hadn't really thought about: ReLU, leaky relu, and prelu can use less activation memory (1bit or 2 bits) than other nonlinearities.
⭐ Datamodels: Predicting Predictions from Training Data. The Madry lab project that motivated FFCV. They trained 100k imagenet models and a few million CIFAR models, and have a dump of the results here. They use this data for, e.g., "identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space"
Flashlight: Enabling Innovation in Tools for Machine Learning A bunch of FAIR people made a pytorch alternative designed to be hackable for systems researchers. 27kloc and 60 ops total. (pytorch and tf are both over 1M loc and thousands of ops).
Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity The dude at Rice who's trying to train neural nets on CPUs with LSH wrote another paper. Eval is entirely with MLPs on weird datasets where the runtime is dominated by the many-class output layer, which is where the LSH stuff actually works.
DynaMixer: A Vision MLP Architecture with Dynamic Mixing. Numbers in tables look good, but looks super memory-bandwidth bound. For N spatial tokens (N = H * W), they first project down the channel axis to get an Nxd matrix, then flatten it and multiply by an Nd x N matrix to get an NxN attention matrix A. Then the output is AX, where X is the NxD matrix of tokens. Except they do this multi-headed for different subsets of D and concat.
O-ViT: Orthogonal Vision Transformer Forces self-attention weight matrices to be orthogonal, but doesn't clearly help, even on their small-ish eval tasks.
You Only Cut Once: Boosting Data Augmentation with a Single Cut Split the image in half horizontally or vertically and apply different augmentations to each half. Seems to usually help on (ImageNet, ResNet-50) when used in conjection with autoaug, mixup, and a few other schemes. Surprised that it helps even with hflip. Simple to implement and might be worth trying.
Möbius Convolutions for Spherical CNNs A Taco-Cohen-style paper about baking in fancy invariances into CNNs with math. Got way better accuracy on certain tasks requiring 3d (or 2d?) rotation invariance. But seemingly no runtime results.