2022-2-12: Generating training data, EfficientNet-X, Editing factual knowledge

Davis Blalock

May 03, 2022

Locating and Editing Factual Knowledge in GPT They edit factual knowledge in GPT-J, meaning they, e.g., get the model to generate sentences as if the Eiffel tower were in Rome rather than Paris. Find-and-replace for inputs like the Madry paper, but they do this at the token level instead. Rank-1 update of one layer’s weight matrix is nice, but they have to optimize the "replacement" embedding via backprop, so I doubt there's much speedup to be had. Twitter thread summary.
Understanding Rare Spurious Correlations in Neural Networks. Spurious correlation is stuff like only recognizing croquet balls when the background is grass. Turns out just a few examples are enough to make the network learn these, and they're hard to unlearn even if you try.
Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models. Elegant unification of different associative memory formulations. I've been thinking about associative memories as a supplement to backprop / alternative to explicit retrieval and found this helpful in understanding the current landscape.
⭐ Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Claims to perform as well with zero samples as few-shot learning on 32 samples. Approach is to have a generative model output class-conditioned sequences (with well chosen prompts) to fine tune a BERT on. Didn't scrutinize this in detail, but techniques like this becoming standard practice could change the economics around pre-trained models, allowing them to also provide labeled data.
Learnings from Federated Learning in the Real world. A bunch of Alexa people describe problems they've faced and algorithmic / pipeline choices that have helped. I no longer follow the federate learning literature, but papers like this describing real-world deployments are particularly rare and valuable resources.
Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning. Based on algorithm 1, this seems to just be SAM but with a convex combination of the gradient at the current params and the gradient and the perturbed params (SAM is a special case that only uses the latter). They report it helping slightly more than SAM on ImageNet, although SAM didn't seem to help as much as I would expect. Also some composition results on CIFAR{10,100} with different data augmentations. Seems like an easy-to-implement SAM variation that might do even better.
Red Teaming Language Models with Language Models. This combined with the above training data generation paper makes me suspect we'll eventually end up in a weird situation where we're using ML to train and validate ML (more fully and directly than we currently do).
The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training. They frame this as a positive result but looks more like a negative one. 60% sparsity on wideresnet50 on ImageNet yielded ~0.5% lower accuracy. This is a large loss for very little sparsity.
Block-NeRF: Scalable Large Scene Neural View Synthesis. They built cool 3d models of a neighborhood in san francisco by building separate NeRF models for different blocks and querying the appropriate ones at each frame for a given position and viewpoint. The video looks really good.
Data Scaling Laws in NMT: The Effect of Noise and Architecture "We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data can significantly degrade the scaling exponent."
⭐ Searching for Fast Model Families on Datacenter Accelerators. Propose a new EfficientNet-X that's apparently "up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100." Design is based on NAS with a search space that's not totally oblivious to how hardware works. Although the fact that no one on the paper identified that "GPUv100" isn't a thing is a bad sign.

Davis Summarizes Papers

Discussion about this post