This newsletter made possible by MosaicML.
Retentive Network: A Successor to Transformer for Large Language Models
They introduce an exceptionally promising attention variant.
Basically, they:
Ditch the softmax
Let each token attend only to a vector of state, instead of all previous tokens
Do LayerNorm separately on each head
Exponentially decay the attention with respect the sequence dimension, with a different decay coefficient for each head
This lets them compute attention efficiently online, in parallel, or in chunks.
This is similar to just having an RNN, using state space models, linear attention, and more, but isn’t quite the same as any of them.
Crucially, because the output for each token depends only on a vector summarizing the past, instead of the full history of past keys and values, you have no KV cache and get O(1) generations with respect to sequence length.
As a result, their attention variant lets you use way less memory and generate tokens much faster for large sequence lengths.
Now it’s not that hard to devise a token mixing scheme that runs faster than regular attention. What’s hard is doing it without losing accuracy.
So the surprising part here is that this scheme apparently improves both perplexity and downstream task performance.
It also scales better with model size, at least when looking at some {1.3B, 2.7B, 6.7B} models.
These are remarkably good results. If others can replicate them, this approach could totally become a new standard practice.
How is ChatGPT's behavior changing over time?
OpenAI’s APIs have changed in quality significantly in the past few months. In many cases, GPT-4 got worse while GPT-3.5 got better.
GPT-4 also seems to have gotten more hesitant to answer questions while GPT-3 got less hesitant. This means that some prompts that previously worked later stopped working.
GPT-4 often got more concise as well. This makes sense from a resource usage perspective but not really from a revenue one (with per-token billing).
For some reason, both models got way worse at generating executable code. Although often the code is just commented out, rather than outright incorrect. Update: The twittersphere figured out that postprocessing to account for this might be enough to reverse the GPT-4 degradation. Similarly, the GPT-4 primality numbers are partly a result of trading off true positives vs true negatives differently.
Since we don’t know what model changes yielded these output changes, this is more interesting from a business perspective than a technical one.
Mostly this finding reinforces my view that third-party AI APIs and organization-specific models are nearly disjoint markets. Like, shipping your data to an expensive third-party API that might start refusing to answer your query after a minor version update is…not an ideal offering for many companies.
*But* these APIs are super convenient, great for prototyping AI features, and about as accurate as you’re going to get if you only have enough data for a few in-context examples. They’re also great if you’re a consumer who wants a wide range of different queries answered; realistically, I’m just gonna type everything into ChatGPT rather than go find specialized apps for different purposes.
Basically, this is consistent with the worldview that tasks with serious business value will be handled by in-house models, while the long tail of lower-value queries will be fed to third-party APIs or open source models.
(EDIT) Also, just to be explicit: I’m not trying throw shade at OpenAI here. I see this largely as an intrinsic limitation of general-purpose APIs—you can make it better on average, but you can’t simultaneously avoid regressions across all possible use cases.
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
When you ask a text model to generate an explanation for its answer, you’d hope that its future responses are consistent with that explanation. E.g., suppose it says a bacon sandwich is hard to get in a certain area because bacon is hard to get there; if you then ask it whether bacon is hard to get there and is says “no”, there’s an inconsistency.
They constructed an eval pipeline to measure how often model explanations are inconsistent in this way. This pipeline:
generates other statements whose truth values should follow from the explanation,
asks the model about these statements, and then
checks how often what the model says corresponds to human-assigned truth values.
Turns out GPT-3 and GPT-4 often generate inconsistent explanations.
Also, how often a model generates a seemingly satisfying explanation doesn’t correlate with how often its explanations are consistent.
Insofar as this shows a gap between seeming good to humans and actually being good, this bodes poorly for deceptive alignment. It also shows that it’s still hard to build an accurate mental model of what LLMs are doing.
Variational Prediction
Normally to get our posterior predictive distribution, we factorize it into the posterior distribution over our latent variables and the conditional distribution of our test input given a particular latent variable values.
They propose to just learn the posterior predictive distribution directly. They do this by assuming a different graphical model of how the world works (Q) than the Bayesian one (P).
To get a useful objective, they define a variational upper bound that’s minimized when we align the distributions implied by these two graphical models.
They also show how to condition on some variables to predict others (e.g., predicting a class label from observed features).
They only show results on a toy problem and apparently encountered difficulties scaling the method up, but this is one of the most simple and interesting ideas I’ve seen in Bayesian stats in a long time.1 The main body of the paper is only six pages and it’s fairly approachable, so I’d definitely recommend it if you’re into probabilistic inference.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
New FlashAttention that gets 50-73% of peak FLOPS instead of 25-40% on an A100. Since it doesn’t change the math of attention, this is just a free win.
Scaling Laws for Imitation Learning in NetHack
They found clear power law scaling relationships for behavior cloning on NetHack. Even more interesting, they were using LSTMs rather than transformers.
We still don’t know why power law scaling keeps showing up, but this is evidence that it’s not specific to transformers and more evidence that it’s not specific to NLP.
It’s also surprising that they got such straightforward power laws given that at least one other RL paper didn’t unless they used specific eval metrics.
Efficient Guided Generation for Large Language Models
To make sure your LLM’s outputs match a regex or context-free grammar, you can mask out the logits for unacceptable tokens before each generation step.
But doing this naively can be slow; you’d have to iterate through 10k+ tokens and check if each one matches your rule. To make this faster, they build an index and finite state machine to quickly identify candidate subsets of tokens offline.
It’s a little unclear how much this helps from a speed perspective, but they do have a nice Python API:
Besides providing a solution for constrained decoding, this also made me think more deeply about how this is actually quite a hard problem.
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
When doing post-training quantization of LLMs, you should use fp8 and fp4 formats for your activations and (probably) your weights.
This is consistent with what you’d expect given the distributions of weights and activations. See, e.g., these histograms of the activations for different layers (columns) at different points in training (rows).
TinyTrain: Deep Neural Network Training at the Extreme Edge
They make on-device training work well for super resource-constrained devices by departing from the typical pretrain-finetune paradigm in a couple ways.
First, they add an extra meta-training step after the pretraining to try to increase the returns for their subsequent on-device few-shot learning.
Second, they use an objective based on Fisher Information, memory constraints, accuracy gain per-parameter, and accuracy gain per-MAC to select subsets of channels and layers to finetune.
Selecting task-specific channels using their method works better than doing generic channel pruning after the meta-training.
This channel-sparse training gets them large training speedups compared to full finetuning.
I’m surprised that they manage to exceed a 1.5x speedup vs full finetuning, since you’d expect to still have to do a full forward pass and backward pass (and would therefore, at best, reduce the wgrad time to zero). But it looks like maybe (?) they’re just pruning the non-selected channels outright, at least during training?
Their overall pipeline yields much better accuracies than various baselines, including more obvious approaches.
Both the meta-training and the task-specific channel selection help the accuracy a lot.
Besides practical on-device training being a big privacy win, this also makes me wonder whether we should just be adding a meta-training step after pretraining in general…
What do neural networks learn in image classification? A frequency shortcut perspective
Image classifiers often learn to recognize classes via patterns of frequency content. This can result in, e.g., stuff that looks nothing like a cat getting classified as a cat.
Improving Multimodal Datasets with Image Captioning
A common practice when cleaning (image, caption) datasets is throwing away pairs when the caption isn’t aligned well with the image according to a CLIP model. But it turns out that this often throws away perfectly good images that just had a bad caption.
This paper proposes to instead generate replacement captions with BLIP2 for pairs that have low image-caption alignment. If you filter both the raw captions and generated captions based on CLIP similarity, you can get a much better training set.
These improvements hold not just for image classification, but also for retrieval.
Part of this lift may be a result of BLIP2 captions and filtered captions being longer on average. This means we’d expect them to provide more supervision and/or not be garbage strings like “PRODUCT#0000007”.
It could also be that BLIP2-generated captions are just plain better than human-generated ones on average, at least when “better” is evaluated by CLIP.
The ability of BLIP2 + filtering to provide a mix of caption quality and diversity also seems to be a contributing factor.
Looks like an easy + large win for anyone constructing an (image, caption) data pipeline.
AlpaGasus: Training A Better Alpaca with Fewer Data
They find that they can get better instruction tuning by filtering the Alpaca dataset from 52k samples down to 9k.
Their filtering pipeline is automated and basically just asks ChatGPT to rate how good each sample is.
This apparently works, beating both the unfiltered dataset and random subsets of the same size.
On the one hand, the Alpaca dataset was generated by OpenAI’s text-davinci-003, the sample rating model here is ChatGPT, and the eval model is GPT-4—so there might be some overfitting involved. In particular, I would expect ChatGPT and GPT-4 to be trained on overlapping datasets, so the responses that these two models like are probably correlated.
But on the other hand, the basic result that using a smaller dataset works better than using a larger one (for some metric) is still interesting and supports the superficial alignment hypothesis.
Not that I follow this literature that closely
Interesting observation about Retentive Networks: https://twitter.com/ericzelikman/status/1682097753151660032?s=46&t=R1HcRy3wUpT5EYNQxGl8wg
They seem severely undertrained compared to other networks like Llama 2. Wondering if they just converge a little faster in the beginning of training and hence the favorable perf compared to regular Transformers.
There might be a few corrections to make on the "How is ChatGPT's behavior changing over time?" summary. Do not take this in bad faith, it's just that I read this newsletter and like it to stay true to the facts.
> In many cases, GPT-4 got worse while GPT-3.5 got better.
You might not be aware that claims of performance decreases seems to be misplaced, at least on the experiments investigated in the paper.
In particular, https://twitter.com/Si_Boehm/status/1681801371656536068 claims the LeetCode performance of the produced code got significantly better.
Similarly, https://twitter.com/tjade273/status/1682009691633614849 claims, depending in what setting you test primality detection on 5-digit numbers properly, the June version is either significantly better or about the same.
> OpenAI’s APIs have *quietly* changed in quality a lot in the past few months"
The paper investigates the difference between gpt-4-0314 and gpt-4-0613. The old version is to be supported until at least June 2024. Every OpenAI developer got an email introducing the new version.