2021-9-19 arXiv roundup - OMPQ, Don't pretrain?, EfficientBERT, Primer
OMPQ: Orthogonal Mixed Precision Quantization
They figure out how many bits to use for different layers in <9 seconds by using a proxy objective. In the camp of "read this if and only if you care about about quantization."
Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative
They beat fine-tuning on some tasks using multi-task learning + meta-learning. Kind of makes sense intuitively to train directly on what you care about to at least some extent, but the results didn't seem too conclusive, and their method is more complicated.
EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation
Doomed to be overshadowed by:
Primer: Searching for Efficient Transformers for Language Modeling
Basically just saying to use ReLU^2 instead of softmax as the attention matrix nonlinearity.