The Scaling Exponent Hypothesis
Dario gestures at a trend that while expected cluster size improves at a fixed, predictable pace due to the bottlenecks of building new data centers and rate of hardware improvements, algorithmic progress moves at warp speed, allowing models of the same performance to be trained using less compute.
Connecting the datapoints tells us AI progress projections have historically underestimated algorithmic progress, which is now churning forth at a pace several times faster than even Moore’s Law.
Taking into account the bias to expect slow progress on this front, and applying a "debias" of anticipating faster progress to Dario's 2026-27 human-level AI timeline, I revise my AGI timelines to 60% by EoY 2025.
At present, this document is rough speculation on little data. (But so was Moore's Law). Please send me sources that argue against this – I promise I will read them.
Contents
1. Years Ahead but Way Behind
In "On DeepSeek and Export Controls", Dario Amodei reveals that their flagship model, Claude Sonnet 3.5, took on the order of "$10s of millions" to train, vastly cheaper than reports of GPT-4 ($50M — $200M), while also being 10 times cheaper to inference than its predecessor.
This data point rests on a curve of such cost improvements: The training input, between the time, money, compute, and data required to train large models is based on a set of algorithms (from self-attention to MLP blocks in transformers to optimizations in GPU kernels that can allow those to run faster at a hardware level), and the output of those algorithms (seen in the performance quality of the trained models) is getting better over time.
Per Dario's essay, little of this is public, but open source public improvements suggest an empirical curve. Compare training and inferencing a 1.5B parameter GPT-2 in 2019 to 2025, incorporating all these major improvements:
- Flash Attention reported a 3x improvement in training speed for 2019's GPT-2 in 2022.
- GPT-NeoX's parallel attention and feedforward networks
to-do: complete this math
                
                Dario notes down that algorithmic progress alone is plowing forth at a pace of 4x gains per year today, meaning a model that takes $400m to train right now will take $100m to train next year, and $25m the year after that – and this trend is somewhere between 1.5x to 4x year-on-year since 2020.
One of his sources, an Epoch AI progress report, analyzes algorithmic progress data from 2014 onward to propose an inverse Moore's law: the total compute required to reach a level of performance halves every 8 months. This report, published in March of 2024, would underestimate where Dario pegs today's progress (an 8 month halving time is ~2.7x per year efficiency gain). Another source Dario cites is a 2020 paper from OpenAI that found progress from 2012 to 2019 to "double every 16 months", albeit this document looked at performance on ImageNet, and EpochAI's report looked at language modeling.
Plots of upward progress like this do not hold very often, and less often do they survive four years to end up behind, dramatically underestimating where we would really end up.
A "meta" predictor factoring in a bias of slow algorithmic progress would suggest that their projected performance for a training run in the $10s of billions will arrive sooner than their own expectations. So if Dario projects a run in the “millions of chips” to result in models "smarter than almost all humans at almost all things” to take place between 2026 and 2027, our meta-Dario, debiased to factor in the second-derivative of AI progress, would predict a run producing such capability to complete even earlier.
This disrupts naive estimates of costs, and pulls into question well-informed ones based on the scaling laws of current models. What will take a 5GW datacenter now may take a sub-1GW datacenter in a few years. And if this continues, a training run of such performance may fall into the scope of existing clusters.
Even if the actors at play are factoring the pace of progress into account, they don't seem to be accounting for their own prediction errors. On this basis, I forecast general human-level AI by the end-of-year 2025. To be specific, I expect that by the end of this year, a system will exist that is capable of drop-in replacing an average remote-work software engineer, autonomously conduct AI research at the level of the average "brilliant PhD" on its own, and make new discoveries on the level of drop-out or skip connections. A critical threshold will be passed where all further developments become the domain of AI systems as they are, a shock-wave knocking down all obstacles. What happens next? To echo DeepMind's Shane Legg in 2009:
Due to greed, wishful thinking, ignorance and what have you, in general safety will come second to progress. A short period of time later the post human period will begin.
to-do: make more specific, falsifiable predictions
                my naive math: if 16 month doubling in 2020, to 8 month doubling in 2024, to 6 month doubling in 2025 – and we'd expect since anthropic was founded on the basis of a $10b run on GPT-3 estimates to be human-level ai –> let's say "doubles" in four years, let's say GPT-3 in 2020 = $10M training run and from this Anthropic projects a $10B training run for HLAI on a 5GW cluster, and by 2024 we're at an 8 month efficiency doubling, by 2025 it's 6 month, at some point, it gets 
 something something this makes me guess end of year 2025.
                Human-level AI is much more plausible as a near-term project in the $10s of billions on this trajectory, not a massive war effort. To put this into relative terms, Apple's R&D budget in 2022 was $27.65B alone!
2. Seeing With Eyes Unclouded
You must see with eyes unclouded by hate. See the good in that which is evil, and the evil in that which is good. Pledge yourself to neither side, but vow instead to preserve the balance that exists between the two.
–Princess Mononoke, Hayao Miyazaki
Algorithmic progress of autoregressive transformers isn't the only factor at play, what new RL techniques might be doing in a Gwern-style from the model's perspective
                Better coding models rapidly explodes the potential architectural search space at our fingertips – this is one actual mechanism by which I believe that "4x improvement" figures will stay on track
                Timeline estimates are difficult. Being wrong about specifics doesn't necessarily mean being wrong in spirit. In 2011, Shane Legg predicted a "proto-AGI" capable of learning primitive processing on its own by 2019, and this was arguably achieved by DeepMind's Gato in 2022. But Gato was just self-supervised learning on synthetic data from RL on video games! (Granted, stiched together with ViT – which wouldn't be out until 2020 – to enable image processing, and a few other things tossed in, but nothing all-too special). 2019's GPT-2 was a fire-alarm alert that unsupervised learning could scale, arguably vindicating the spirit of Legg's prediction – that a system would exist capable of learning advanced representations from data on its own.
I expect to see an impressive proto-AGI within the next 8 years. By this I mean a system with basic vision, basic sound processing, basic movement control, and basic language abilities, with all of these things being essentially learnt rather than preprogrammed. It will also be able to solve a range of simple problems, including novel ones.
–Goodbye 2011, hello 2012, Shane Legg
In the Scaling Hypothesis, Gwern reigns us back into historical perspective:
There is, however, a certain tone of voice the bien pensant all speak in, whose sound is the same whether right or wrong; a tone shared with many statements in January to March of this year; a tone we can also find in a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen”, which advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)
–The Scaling Hypothesis, Gwern Branwen
The Scaling Hypothesis was prior to any RL on language models actually working. Not really even RLHF. With o1 and RL-on-chain-of-thought showing that the limits of scaling decoder transformers on more data are surmountable with newer architectures or training regimes, with Dario's prophecy that dark Satanic Mills of millions of GPUs may soon spur to life and that the projected performance of a training run on such hardware being underestimated time and time again by the discovery of better "scaling exponents", and Ilya Sutskever retreating to full research hermeticism, the intimations in Gwern's original 2020 post are rapidly reaching fruition. It all gives the impression of some "critical threshold" just out of reach. Whether this is achieved in a year, six months, tommorrow, or next week isn't of much relevance.
This section post should aim to identify the shadow of take-off looming, convincingly, with technical details about scaling exponents outside of unsupervised learning, such as RL on chain-of-thought
                Usually we are too wrapped up in our stories to catch even a glimpse of the gods. When we do, we can fight them, as Rachel Carson did after seeing disturbing trends in air and water pollution. Or we can ally with them, as Moore (and Kurzweil and Kaplan) did after seeing the exponential compute trends. But—ah, I hear the gods laughing again, in the face of these stories it’s always so tempting to tell. To the gods of straight lines, Carson and Moore did nothing, because the gods see (as we do not) the other timelines where the same insights came from other people, a month or a year or a decade delayed, but landing all the more powerfully because of it. The gods are intimately familiar with a fact that we can only hazily glimpse: that all great discoveries come in their natural time.
The Gods of Straight Lines, Richard Ngo
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (OpenAI, 2024a; Anthropic, 2024; Google, 2024), progressively diminishing the gap towards Artificial General Intelligence (AGI).
–DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, DeepSeek
Bibliography
Gwern Branwen on so-called "Scaling Exponents":
Well, Ilya would know better what OA was doing under Ilya that led to Q*/Strawberry, and what SI is doing under Ilya now, and how they are different... As I still don't know what the former is, it is difficult for me to say what the latter might be.
In RL, minor input differences can lead to large output differences, to a much greater extent than in regular DL, so it can be hard to say how similar two approaches 'really' are. I will note that it seems like OA no longer has much DRL talent these days - even Schulman is gone now, remember - so there may not be much fingerspitzengefühl for 'RL' beyond preference-learning the way there used to be. (After all, if this stuff was so easy, why would anyone be giving Ilya the big bucks?)
If you get the scaling right and get a better exponent, you can scale way past the competition. This happens regularly, and you shouldn't be too surprised if it happened again. Remember, before missing the Transformer boat, Google was way ahead of everyone with n-grams too, training the largest n-gram models for machine translation etc, but that didn't matter once RNNs started working with a much better exponent and even a grad student or academic could produce a competitive NMT; they had to restart with RNNs like everyone else. (Incidentally, recall what Sutskever started with...)
Stella Biderman on improvements to the vanilla transformer architecture:
There have been close to no improvements on the original transformer architecture; almost everything is a wash. The only real differences between current SOTA and the original paper are:
- You don’t have to use an encoder-decoder architecture, both decoder-only and encoder-only architectures are also useful. Different architecture are better at different tasks. Similar statements can be made about the training objective.
- There’s a major error in the paper Attention is All You Need where they accidentally put the layer norms after the layers not before them. The impacts are laid out very well in this paper. Note that the code for Attn is All You Need did it correctly, but nobody noticed and copied what is wrongly written in the paper.
- Ben Wang at EleutherAI figured out that you can put attention layers and MLPs in parallel. This doesn’t really effect performance but makes the model run much faster. This was first introduced in the GPT-J-6B model and first described in a paper by GPT-NeoX-20B: An Open-Source Autoregressive Language Model.
- The original positional embedding method is garbage. Basically anything else is better, but Rotary Positional Embeddings are currently considered the mainstream way to do it. They’re basically Sinusoidal embedding but done correctly lol.
That isn’t to say that there haven’t been some changes. Application paradigms and finetuning have especially changed… first domain-specific finetuning, then few shot prompting, then multitask finetuning and reinforcement learning from human feedback. But I do think it’s remarkable how much the OG paper got right and how little in the architecture has actually changed.
(This was posted prior to GPT-4. While GPT-1 to 3 were essentially scale-ups of the 2017 transformer decoder, GPT-4 (according to rumors) used a mixture of experts, a considerable architectural change. Additionally, longer contexts and inference have motivated changes in self-attention.
To-do: Expound on changes post-GPT-4