2021-12-18: Data-free Knowledge Distillation, MagNets
A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code
Exactly what it sounds like. Might make for a nice IDE plugin or linting tool.
AdaViT: Adaptive Tokens for Efficient Vision Transformer
Intelligent token dropping that apparently "improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop", which might actually be a decent win.
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
"...the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models." Distillation tends to be a pretty big accuracy lift, so my takeaway from this is that it was apparently hard to make a combined model not completely suck. Kind of defies the intuition that feeding in 2+ modalities should help with symbol grounding / improve results, but maybe doesn’t happen when you’re limited by model capacity.
⭐ Up to 100x Faster Data-free Knowledge Distillation
"Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10× and even 100× acceleration while preserving performances on par with state of the art." Doesn’t yield the accuracy lift of regular distillation, but does help train models without any access to the raw data.
Magnifying Networks for Images with Billions of Pixels
"a MagNet processes a downsampled version of an image, and without supervision learns how to identify areas that may carry value to the task at hand, upsamples them, and recursively repeats this process on each of the extracted patches." They only test on spatially huge datasets, but there might be promising ideas here. Plus I just find that attempting to tackle an extreme version of a problem is often a great forcing function for looking at it in a new way.
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction
Speeds up "super-resolution, image inpainting, and compressed sensing". Nice because it's just a different initialization, so should compose well with a lot of other methods.
Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition
An MoE applications paper from Microsoft. Haven't scrutinized it much, but seems to confirm that MoE is a thing people are starting to use.