2021-1-16: Grokking, Semantic segmentation with {BERT embeddings, only image-level labels}
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets OpenAI paper where they try to teach neural nets to solve problems with algorithmic solutions. The bayes error rate for these problems is zero, but the relationships are not as smooth as in most traditional tasks, making them much harder to learn. What they find is that long after the model has memorized the training set, it suddenly starts doing well on the validation set. They refer to this phenomenon as "grokking". Not that actionable, but interesting work that thinks more deeply about the nature of intelligence than the typical deep learning paper.
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance They have a procedure that apparently predicts OOD performance really well based on prediction confidence on unlabeled data.
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning This seems to be a demonstration that hypernets work for few-shot learning with vision tasks. The opportunity here is that you might be able to pretrain one giant model, but only use a tiny, task-specific model at inference. So you 1) still amortize the training cost across many downstream tasks, and 2) can justify even bigger pretrained models because you don't have to worry about their inference costs.
⭐ Language-driven Semantic Segmentation Given per-pixel image labels, they train the model to make pixel embeddings look like a BERT embedding of corresponding class name. This lets it generalize to new classes with zero retraining or new examples.
⭐ Detecting Twenty-thousand Classes using Image-level Supervision They manage to train an object detector using only image-level class labels. They do this by training a detector directly and lying to it about the bboxes. Namely, they take the region proposer's largest box as the single ground truth bbox for the image, and treat it as having a label equal to the image's label. If there are multiple labels for the image, they all get this same bbox. They also hardcode the CLIP embeddings as the final softmax so it works on an open vocabulary (i.e., new classes at test time).
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? Another self-supervised ResNet. Doesn't really look like it does better than detcon (and doesn't compare to it or show clear training curves), but maybe it does. They have a more elaborate supervision construction method, and there might be some insights there. 1000 epochs gets them 77.1% ResNet-50 accuracy on ImageNet.