Davis Summarizes Papers

Share this post
2021-12-18: Data-free Knowledge Distillation, MagNets
dblalock.substack.com

2021-12-18: Data-free Knowledge Distillation, MagNets

Davis Blalock
May 3
Share this post
2021-12-18: Data-free Knowledge Distillation, MagNets
dblalock.substack.com

A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code Exactly what it sounds like. Might make for a nice IDE plugin or linting tool.

AdaViT: Adaptive Tokens for Efficient Vision Transformer. Intelligent token dropping that apparently "improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop", which might actually be a decent win.

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text. "...the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models." Distillation tends to be a pretty big accuracy lift, so my takeaway from this is that it was apparently hard to make a combined model not completely suck. Kind of defies the intuition that feeding in 2+ modalities should help with symbol grounding / improve results, but maybe doesn’t happen when you’re limited by model capacity.

⭐ Up to 100x Faster Data-free Knowledge Distillation. "Experiments over CIFAR, NYUv2, and ImageNet demonstrate that the proposed FastDFKD achieves 10× and even 100× acceleration while preserving performances on par with state of the art." Doesn’t yield the accuracy lift of regular distillation, but does help train models without any access to the raw data.

Magnifying Networks for Images with Billions of Pixels "a MagNet processes a downsampled version of an image, and without supervision learns how to identify areas that may carry value to the task at hand, upsamples them, and recursively repeats this process on each of the extracted patches." They only test on spatially huge datasets, but there might be promising ideas here. Plus I just find that attempting to tackle an extreme version of a problem is often a great forcing function for looking at it in a new way.

Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction. Speeds up "super-resolution, image inpainting, and compressed sensing". Nice because it's just a different initialization, so should compose well with a lot of other methods.

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. An MoE applications paper from Microsoft. Haven't scrutinized it much, but seems to confirm that MoE is a thing people are starting to use.

Share this post
2021-12-18: Data-free Knowledge Distillation, MagNets
dblalock.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNew

No posts

Ready for more?

© 2022 Davis Blalock
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing