2024-4-28 arXiv roundup: data and scaling…

Apr 29, 2024

Besides getting to cover unusually interesting work, the upside of having a big backlog is that you can group your coverage thematically.

Read →

1 Comment

Aaron Scher

Apr 29, 2024

> While the results of this paper are scoped to image-text models, this makes me super curious whether this log-linear relationship holds for text and other models.

You may be interested in Kandpal et al., which this paper cites but does not thoroughly discuss. https://proceedings.mlr.press/v202/kandpal23a/kandpal23a.pdf

You may also be interested in https://arxiv.org/abs/2404.01413 regarding

> Especially interesting is their demonstration that having just 2% of the data come from the non-AI-generated distribution can significantly (though not completely) mitigate the damage for a LLaMA 2 model.

Expand full comment

Davis Summarizes Papers

2024-4-28 arXiv roundup: data and scaling…