2022-1-4: Two sparsities, Vision reservoir

May 03, 2022

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. Token pruning for transformers. Results didn’t sound especially compelling but who knows.

Resource-Efficient Deep Learning: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques. Just what it sounds like.

Training Quantized Deep Neural Networks via Cooperative Coevolution. They claim they got 4-bit CIFAR-10 training working at iso accuracy, but the paper is hard to read and this seems to be product of just throwing a ton of compute at trying different quantization configurations. There might be a cool core here with bad packaging.

Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks. They claim up to 100x speedup on resnet50 and mobilenet by exploiting both weight and activation sparsity on an FPGA. They have good figures and there might be good ideas here. I'm skeptical of it though because 1) they have no comparison to existing work and 2) I've never seen Numenta produce anything state-of-the-art.

Fine-Tuning Transformers: Vocabulary Transfer
Apparently using corpus-specific tokenization for fine-tuning can help transfer learning, at least if you initialize things right. Seems fairly actionable, though I didn't look at it closely.

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies. Looks like Luke Metz + Jascha figured out a better way to do gradient-based hparam tuning / meta-learning in the same vein as Teaching with Commentaries.

ViR: the Vision Reservoir. They don't go bigger than CIFAR100, but big lifts over ViT without pretraining by just having a big reservoir of token representations (or something like that). ViT without pretraining kinda sucks, but there might be an interesting lesson here, since I wouldn't expect the reservoir aspect to naturally fix ViT's lack of inductive bias (and therefore poor performance without pretraining). Basically, small but decent prob that this is introducing really interesting findings regarding network design.

Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow. tl;dr deep learning software is buggy

Davis Summarizes Papers

Discussion about this post