Research Archetypes: Scientists, Mathematicians, Inventors

Apr 21, 2024

What "research" entails day-to-day varies by field, subfield, problem, and individual researcher. However, there are clusters of project types and research tendencies that tend to go together.

Within computer science, I see three main research archetypes: Scientists, Mathematicians, and Inventors. Understanding what archetypes you tend towards and what archetypes your project falls under can help you choose what to work on, what sorts of results you need to write a great paper, and what pitfalls to avoid.

All mental models are wrong, but I hope you find this one useful.

Scientists

“Science” research strives to create new knowledge through experiments. “Science" papers often ask a question in the title or abstract and answer it with thorough experiments and analysis. They may also put forth a conjecture, such as the Lottery Ticket Hypothesis.

Good science papers:

Ask useful questions
Improve the reader’s understanding of some phenomenon, rather than merely reporting numbers. In the best case, the findings become an integral part of the reader’s mental models going forward.
- E.g., ever since reading this paper, I’ve always thought of neural net loss landscapes as being subquadratic and made sense of results like the edge of stability in this light.
Make nuanced claims that don’t step outside what their experiments demonstrate
Highlight what can't be concluded from their results
Attempt to harmonize their results with related studies
- Example 1: Chinchilla harmonizing with OpenAI scaling laws
- Example 2: Scaling Laws for Fine-Grained Mixture of Experts harmonizing with Unified Scaling Laws for Routed Language Models

One “science” trick I learned from Jonathan Frankle and Mike Carbin is to choose experiments that are certain to be informative no matter the outcome, rather than betting on a particular outcome. E.g., when Zack was dynamically varying masking probabilities for BERT pretraining, the result would be interesting whether increasing or decreasing the masking rate was better (or if it didn’t matter at all, or if a fixed masking rate was the best, etc).

Few people write these papers and those that do tend to have above-average experience and experimental resources, making the average quality high. However, asking a useless question, failing to control experiments properly, or failing to convey the circumstances under which the results hold can all render these papers useless.

These papers can also be harder to justify in industry because they don’t move any metrics. They rely on being informative enough that they help you design interventions that do move the metrics.

If you want one-stop-shops of good science papers in machine learning, look at stuff from Behnam Neyshabur, Sara Hooker, and Jonathan Frankle.

Mathematicians

Many papers focus on proving theorems.

What’s great about these papers is that their results are the strongest and most trustworthy. They’re not just usually true, or true in some poorly-defined circumstance; unless there’s a flaw in the proofs, they are universally, unequivocally true. They’re also likely to be timeless, since their dependencies are standard mathematical machinery, rather than particular software.

The bad news is that these papers are the least likely to be useful in practice. While theoretical guarantees can be necessary in domains like differential privacy and causal inference, typically a theory paper’s utility is serving as scaffolding for practical algorithms. E.g., one should almost never embed vectors with random projections, but the Johnson–Lindenstrauss lemma gives us a floor for how well fancier embedding methods should work.

Math research tends to have the highest skill barrier and be heavily tied to some mix of intelligence and thousands of hours of math experience. At least from my non-theoretician perspective, it also helps your impact a lot to choose problems of practical relevance and explain your derivations well.

My personal favorite math papers are Restricted Strong Convexity Implies Weak Submodularity and Locality-Sensitive Hashing Using Stable Distributions, both of which were approachable enough to teach me new ideas while also proving strong scaffolding results for important problems.

Invention

In “invention” projects, we try to make numbers go up and/or to achieve falsifiable properties like fault tolerance. "Invention" papers feature tables and plots wherein one's proposed method has higher numbers than baselines.

Let’s start with the bad news about this category.

First, most of the claimed improvements are just products of multiple hypothesis testing, graduate student descent, or other confounding factors. Of those that reproduce, many merely appear better thanks to insufficient measurement—e.g., saving FLOPs but not wall time.1

These not-so-useful papers tend to cluster around a few topics with low knowledge and implementation barriers, creating a vicious cycle in which:

There’s so much overcrowding that no reviewer can expect comparisons to more than a tiny fraction of existing approaches, which
lowers the barrier to publication, which
enhances the overcrowding.

This can also stop progress on the problem, which keeps barriers to entry low. See, e.g., the sad state of neural network pruning and my big list of discouraging meta-analyses.

However, the best "invention" papers introduce practical methods that are widely adopted. These few successful papers are, to a large extent, what justify the entire research enterprise.

Some examples of these methods in deep learning are:

If you want a lot of examples of how to write these papers, Eamonn Keogh’s papers up through ~2014 are the best source I know of. This one might be my favorite paper ever. In deep learning, Noam Shazeer is probably the GOAT.

Invention Subtypes: Edison vs Tesla

I’ve spent the most time thinking about how to be effective at the Inventor archetype. Different people have different research styles here, but I’ve found that much of the variation reduces to a single spectrum: Edison vs Tesla. Oversimplifying2 to highlight the contrast, we have:

Edison

Edison was a speed experimentalist. He seemed to be literally addicted to the variable-ratio reinforcement of trying out different ideas at his lab bench until one worked. He would go until exhaustion forced him to nap under his desk, and he wouldn't return home to his family for weeks at a time.

He didn't sit and derive how to build the light bulb on a whiteboard. He didn't wait for materials science to become better understood. He just tried every plausible substance as a filament until something worked.

That's not to say there was no intuition or understanding behind his process—just that his approach was to forge ahead at top speed with whatever knowledge was already available or discovered along the way.

In empirical deep learning research, this looks like: constantly seeking more compute, making sure your GPUs never sit idle, building tooling to run huge sweeps, disregarding theory papers, and thinking of velocity in terms of experiments run.

Tesla

Tesla was, in some ways, the opposite. He was known to design entire machines in his mind's eye, down to the last screw.

He had high conviction in what was possible and how to do it ahead of time. His predictions weren't always correct, but were sometimes prophetic—e.g., he identified that, with a big enough tower, he could transmit text, sound, and even stock prices all over the world in real time.

In empirical machine learning research, this often looks like systems research—you have some insight about an inefficiency, assumption, or empirical phenomenon that others haven't spotted, you design the core of the system/algorithm before writing any code, and you labor for months or years to make your system real. There's an experimental component, but you roughly know what the results will be based on your detailed analysis of the workload and hardware. It’s implementation, not experimentation, that’s the bottleneck.3

The most extreme example of this style I’ve experienced was with Multiplying Matrices Without Multiplying—I spent years thinking about how to do this, and the moment I had the core insight about how to do the tree-based encoding, it was clear it would be a win. There was ~0 uncertainty aside from the exact distortion numbers—it took months to build, but it immediately worked, with no experiments needed along the way.

For more examples of this style, I’d check out papers about optimized kernels or distributed deep learning, such as those from Daniel Lemire or the DeepSpeed Team.

Conclusion

Many great papers (and researchers) have elements of all three archetypes. And most impactful projects have a mix of rapid experimentation and fundamental insights. But hopefully this breakdown helps you think about what you most enjoy, where your strengths lie, and how to maximize the impact of your work.

P.S.: My analysis and examples are heavily overfit to deep learning, so I’d be curious to hear comments on the extent to which all this applies to other fields. Or just constructive criticism in general.

See, e.g., our discussion and experiments showing how easily FLOPs and time can diverge here. tl;dr, there are many possible bottlenecks in your system besides the compute itself, most of which stem from moving data around. See also the roofline model.

I’m basing this subsection off of listening to some podcast episodes and YouTube videos and can’t be bothered to track down citations for the individual claims. The safest reading of my text is as a mythologized narrative designed to illustrate an idea.

To be clear, Tesla also ran a lot of experiments. It’s more that the tendency to design the whole machine in his head first makes him a good contrast to pure experimentalism.

Davis Summarizes Papers

Discussion about this post