7 Comments

I like this analysis. The story in ImpossibleDistillation is a bit more complex than simple generate-and-filter. ImpossibleDistillation is combining information from four models:

1. The initial generation model L_LM

2. The keyword extraction model (KeyBERT)

3. The Natural Language Inference model (RoBERTa-Large fine-tuned on WANLI)

4. The model L_0 that is fine-tuned (T5-large)

Items 2 and 3 form part of the filter chain which includes additional encoded knowledge defining the task (summarization vs. paraphrase, compression, etc.). Viewed abstractly, this shows that we can use initial LMs in both the generation and filtering phases combined with additional human input to create a target system that is better constrained to the specific task.

We are paying for the "free lunch" in many ways (training corpora for the various models, human-provided task definition, etc.). If we iterate the process, the gains will diminish once we have fully incorporated the constraints from the various input models into the target model.

Expand full comment

Although your weekly summaries are essential reading, this analysis is among the best I've read in ages. Keep up the fantastic work!

Expand full comment

This clarifies the issue for me, thank you.

Domains for which we have coverage are all in the past. We are In the present. And we act in the future.

I’m open to pragmatic demonstrations of how machines might predict the future (that is, their future inputs). But how does the problem of induction relate to your conceptual explanation here?

Expand full comment

Great post, I feel like it's hard to get a good sense of how useful it is to train from synthetic data, as many research works are a mixture of data smuggling (e.g. training from GPT-4 outputs) and task constraining.

While LLMs are the focus here, one really neat example of training from synthetic data is the InstructPix2Pix diffusion model. They showed that while diffusion models are not good models for image editing you can finetune them to do so by using pairs of (original image + edit instructions, edited image). They synthetically create example pairs using the original diffusion model, generating from paired captions and using identical noise. The original model probably has enough knowledge to edit images, but the API used to pretrain it does not promote that behavior, however you can still use the model to generate data to change the task.

Expand full comment

I got a chuckle out of the first footnote. I would just slightly correct it to say that regularization improves *generalization*...

Expand full comment