2023-7-23 arXiv roundup: OpenAI breaking…

Jul 25, 2023

This newsletter made possible by MosaicML.

7 Comments

Jul 31, 2023

Interesting observation about Retentive Networks: https://twitter.com/ericzelikman/status/1682097753151660032?s=46&t=R1HcRy3wUpT5EYNQxGl8wg

They seem severely undertrained compared to other networks like Llama 2. Wondering if they just converge a little faster in the beginning of training and hence the favorable perf compared to regular Transformers.

Expand full comment

Daniel Paleka

Jul 25, 2023Edited

There might be a few corrections to make on the "How is ChatGPT's behavior changing over time?" summary. Do not take this in bad faith, it's just that I read this newsletter and like it to stay true to the facts.

> In many cases, GPT-4 got worse while GPT-3.5 got better.

You might not be aware that claims of performance decreases seems to be misplaced, at least on the experiments investigated in the paper.

In particular, https://twitter.com/Si_Boehm/status/1681801371656536068 claims the LeetCode performance of the produced code got significantly better.

Similarly, https://twitter.com/tjade273/status/1682009691633614849 claims, depending in what setting you test primality detection on 5-digit numbers properly, the June version is either significantly better or about the same.

> OpenAI’s APIs have *quietly* changed in quality a lot in the past few months"

The paper investigates the difference between gpt-4-0314 and gpt-4-0613. The old version is to be supported until at least June 2024. Every OpenAI developer got an email introducing the new version.

Expand full comment

Reply (2)

Davis Blalock

Jul 26, 2023

Also, just to explicitly address the incentives: I'm much better off {maxing out credibility, being constrained to true reasons to use something like MosaicML} than {sacrificing some credibility, exaggerating reasons to use us}.

Among other things, this is a consequence of there legitimately being good reasons to use us. Also, in the OpenAI case specifically, I mostly don't see them as a competitor and would prefer not making enemies.

tl;dr I was worried about this looking *intentionally* incorrect all day after reading your comment and would love input on how to ensure it doesn't read that way.

Expand full comment

Reply (1)

Daniel Paleka

Jul 26, 2023

I did not intend to cause stress to you, and I apologize for not going through the email channel first.

For anyone reading this, *I very explicitly and honestly think this was an accidental and honest slip* by Davis Blalock. I think he's quite credible, and cannot reasonably be expected to follow up on all the criticism of any given work.

Wrt the paper itself: after a bit of thought, I have no idea to know which step exactly failed in the process of producing the results -> writing the paper -> writing the title and abstract -> public announcement -> public reception. The average impression of the paper is not true to the results; but some versions of that happen to everyone. Blame is not to be assigned lightly, nor indiscriminately, nor by association.

Wrt evaluation papers: evaluation has become much more difficult with recent models, to the degree which is not yet acknowledged properly in the research community. Prompting means we always underestimate the best possible performance. Memorization of similar tasks can never be excluded. Minor changes in formatting result in uncomfortably high variance. All honest evaluation paper numbers (incl. on papers I'm an author on) should be taken as very imprecise answers to the true question of "what can this model do". To say there are "methodology issues" with any given paper does not cut it: we are in the postmodern stage of LLM evaluation, there is no set-in-stone methodology, there is no correct answer to "what are the capabilities of GPT-4".

Expand full comment

Davis Blalock

Jul 26, 2023

Thanks for pointing this out! I don't scroll twitter much and hadn't seen these results. I've added caveats in my description with links to these tweets.

I also think you're right that my wording is too strong wrt "quietly". While one could describe unanticipated regressions in newer API versions as "quiet," my current prose doesn't do a good enough job of pointing out that there are fixed versions available to developers. As you highlight, this is pretty different from a given API version working differently out of nowhere. I've therefore updated the prose to (hopefully?) avoid implying the latter.

Let me know if you think there's anything else I should change. I definitely make mistakes and comments like this are what help me identify + correct them.

Expand full comment