2021-12-4: Sparsity is Enough in Scaling Transformers, Sparse ImageNet transfer
dblalock.substack.com
Adaptive Optimization with Examplewise Gradients They try to exploit per-sample gradient information (rather than the normal per-batch grads averaged across samples) to improve adam. Seems to be a negative result so far. How Well Do Sparse Imagenet Models Transfer?
2021-12-4: Sparsity is Enough in Scaling Transformers, Sparse ImageNet transfer
2021-12-4: Sparsity is Enough in Scaling…
2021-12-4: Sparsity is Enough in Scaling Transformers, Sparse ImageNet transfer
Adaptive Optimization with Examplewise Gradients They try to exploit per-sample gradient information (rather than the normal per-batch grads averaged across samples) to improve adam. Seems to be a negative result so far. How Well Do Sparse Imagenet Models Transfer?