With the way things are going, scaling advanced AI capabilities to the world will require growth in energy production capacity akin to a second industrial revolution.
Humanity urgently needs to invest heavily in developing radically different and more energy efficient AI technology.
AI may be astonishingly powerful, but it’s also astonishingly power-hungry.
A single query to a cutting-edge reasoning LLM can consume thousands to millions of times more energy than something that we are used to, like a Google search.
Yet the collective dream is that soon everyone, everywhere, will have unlimited access to advanced AI. In the popular imagination, we’re on the verge of a Jetsons-style future where computers automate almost all of the boring stuff we do by hand today.
"Eventually we can each have a personal AI team, full of virtual experts in different areas, working together to create almost anything we can imagine."— Sam Altman
"I'm hoping that Nvidia someday will be a 50,000 employee company with a 100 million AI assistants in every single group."— Jensen Huang
But how realistic is that vision? How hard would it actually be to scale our most advanced AI models out of the lab and into the real world? Are we a few engineering cycles away, or a few fundamental breakthroughs?
To answer that, it helps to ask a simple, grounded question: what would it take to deliver these futures using today’s technology? How many solar panels would we need to give everyone access to current-generation LLMs running on current-generation hardware?
Because this question deals with things that already exist, it has a simple answer: bringing advanced AI to everyone using today’s tech would require scaling our global energy production by several orders of magnitude.
Surely it won’t come to that, right? AI technology is still very immature, and there’s probably low-hanging fruit left to make it more efficient.
Despite the hazards, I'm going to attempt to argue for my own view as someone not working on LLMs: while some efficiency gains remain, they’re nowhere near enough to make the wide-scale deployment of advanced AI capabilities feasible on anything like today’s infrastructure.
That conclusion follows from two beliefs: GPUs are already near the frontier of efficiency, and model complexity grows with capability. The first is almost self-evident from physics; the second is confirmed by decades of computing history.
This doesn't mean that I think continuing to scale is a bad idea: even if our dreams are out of reach of our current approach, we will build a lot of useful things in the process of trying.
Rather, my view is that if we’re going to spend trillions building infrastructure for GPU-based LLMs, we should put equal effort into designing far more energy-efficient AI systems.
Moreover, to stand a serious chance at building more efficient AI systems within the next decade, we need to stay focused on what’s both physically buildable and theoretically sound. That means finding smarter ways to run existing AI algorithms on integrated circuits, not chasing unproven hardware (quantum, memristors…) or speculative architectures (spiking NNs, reservoir computers…) which will take decades to mature.
In the rest of this article I'll argue for my way of seeing things. First, I'll use a simple model of the energy demands of LLMs to analyze a few potential AI futures, and show that we would have to scale up our current power grid by orders of magnitude to realize them using today's tech. Then, I'll show that digital computers (like GPUs) haven't gotten substantially more efficient at the component level in the last decade. Finally, I'll wrap up by arguing that today's models are far less complex than they will be in the future.
The complexity of large-scale AI inference pipelines makes precisely calculating their energy consumption complicated. In general, the only way to get anything close to an exact estimate would be to very nearly build one for yourself.
Instead, I devised an order of magnitude Fermi estimate that gets to the essence of the answer without taking into account every excruciating detail.
My model uses facts about LLMs and GPUs to estimate how much energy it would take to support different AI usage scenarios using today's technology.
Specifically, my model estimates the energy $E$ required for an LLM system with $P$ parameters to process $N$ input/output tokens,
$$ E = 2 P M N \mathcal{E}$$
The parameter $M$ is a reasoning multiplier that takes into account the internal "thinking" tokens produced by a model that are never seen by a user. $2 P M N$ is roughly the total number of FLOPs used by the model to process $N$ IO tokens.[1] $\mathcal{E}$ quantifies the energetic cost of running the model on a particular piece of hardware, and has units of Energy/FLOP.
Let's put this simple model to work and analyze the energy demands of realizing a few different AI futures using today's technology.
First, let's check that my model produces a reasonable output for a . Today, most people chat with LLMs occasionally and use limited reasoning capabilities. Our model predicts that this should constitute around half a percent of the total power delivered by our grid on average [2], which is roughly in agreement with current estimates.
With that out of the way, let's move on to a basic AI-enabled future: . Although such an assistant would benefit from advanced reasoning capabilities, let's begin with one that's as intelligent as today's best chatbots.
This text assistant scenario is relatively easy to imagine from where we are standing today, and roughly corresponds to the average person in the future using as much AI as much as current extreme power users. This is far simpler, and less computationally demanding, than "a team of virtual experts", for example.
Despite the apparent simplicity of this scenario, if we wanted to serve these text assistants to everyone using today's technology, we'd be faced with the daunting task of adding ~20% (or around 100GW) to the average output of the grid.
Adding 20% to the grid sounds hard, but this is nothing compared to richer scenarios that are closer to our collective dreams.
For example, let's look at something one step more complicated: a video assistant that lives on your glasses and interacts with everything you see through a camera.
This video assistant could be much more useful than the text-only version; we get most of our information through what we see, and do a lot of thinking to convert it to text. However, at least the way we do things now, processing images is far more expensive than text.
My model predicts that using today's tech to provide everyone with an at low resolution and frame rate would require an order of magnitude scaling of the power grid (raising capacity to around 5 TW in total).
These scenarios make it clear that we would need to make major changes to the world to provide everyone with an AI system that helped them deal with even a small fraction of the information they process every day.
Along with increased throughput, the energy that an AI system uses increases dramatically with the required level of reasoning capability.
Today, the main way we make an AI system smarter is through test-time compute scaling. Scaling test time compute means allowing the model to use more and more neural network evaluations to come up with the answer to a question. In our model, the effect of test-time compute scaling is captured by a much larger value of the reasoning multipler $M$ compared to today's chatbots.
Running highly intelligent AI systems based on the current test-time compute scaling is shockingly computationally intensive.
For example, if we endow our test assistant with , the energy consumption rises to nearly 250 TW, around 50 times what today's grid supplies on average. I modeled "basic reasoning" off of the current highest performing model on the ARC-AGI 2 benchmark [3] Although ARC-AGI still challenges today's AI models, it requires much simpler reasoning abilities than most problems that the average person encounters day-to-day. As such, a text assistant with this level of reasoning capability would be substantially less helpful than a human assistant.
The numbers only get more extreme as we demand increased intelligence from our system.
Unless a many orders of magnitude more efficient type of AI system is developed, giving everyone even extremely limited access to would be immensely resource intensive. My model predicts that serving every individual a few queries a day to today's AI systems that get gold medals at the IMO math olympiad [4] would require the power grid to be enlarged by more than a factor of 100.
The situation gets even more absurd if we demand increased access to elite-level AI models, which we would if we expected to interact with them like we do our human coworkers. If we each wanted a , we would have to roughly 2000x our power grid. This would bring the total capacity of the grid up to over 1000 TW, which is a completely unimaginable number by today's standards.
Scaling today's tech to bring advanced AI to everyone will require a power grid that is orders of magnitude larger than what we have today.
If we choose to take this path, it will be by far the most complicated and expensive infrastructure humans have ever attempted.
Yet, this brute force scaling seems to be plan A:
"Our vision is simple: we want to create a factory that can produce a gigawatt of new AI infrastructure every week."— Sam Altman
Grid capacity in the US has been increasing roughly linearly since the 1950’s at a rate of around 0.3 GW/week [5], which means that we would have to build power plants 3-4 times faster than we are now to meet the 1GW/week objective. There is very little reason to believe that this is easy or even possible; energy production is the ultimate bottleneck to all technological progress (not just AI), so we are probably already going as fast as we can given the constraints of how our society works.
Even if we pulled off the large-scale industrial re-organization required to build a , the cost of doing so would be astronomical. Even just building enough solar panels to support all of this compute would cost around 100 Billion USD/year [6], not to mention the cost of the rest of the datacenter which would probably bring the bill up to closer to a trillion USD per annum in total. For comparison, the entire US interstate highway system was built for around 500 Billion USD, and was completed over the course of half a century [7].
This all sounds very difficult, but humans are resourceful, and we could probably pull it off if we all decided it was worthwhile.
Unfortunately, based on my calculations, 1GW per week is not nearly enough to bring advanced AI to everyone.
We just showed that based on today’s technology, a more realistic power number to properly unlock the simplest of the aforementioned AI futures is (charitably) something like 10 TW. Even over 2 decades, building 10 TW of compute is in the realm of far-flung sci fi.
To build , we would need to build power production infrastructure ~30 times faster than we do today. If we chose to provide all of this powers using solar panels that are similar to the ones that exist today, they would take up roughly the area of nevada and would cost about 20 trillion USD.
An optimist may still try to argue that the standard incremental march of technological progress will bring this cost down. And this may very well be true; progress in technology is generally exponential, and exponential curves have a way of sneaking up on you.
However, we are at the end of the exponential improvement of the efficiency of digital computers, and AI models are getting exponentially larger over time, not smaller.
Historically, the energetic demands of computing have been significantly curtailed by computers getting more energy effecient. For a long time, logic gates, the building blocks of computers, were increasing in efficiency at a consistent and exponential rate.
However, logic gates haven't gotten much more energy effecient in the last decade, and GPU efficiency improvements have largely been driven by architectural refinement efforts that are nearly tapped out.
Barring any breakthoughs in materials science or advanced manufacturing, which would take decades to commercialize anyways, there's no obvious way for GPUs (as we know them today) to use orders of magnitude less energy.
There are good fundamental physical reasons why logic gates made out of transistors aren't getting more effecient. Much of the energy in a integrated circuit processor is spent charging and discharging the electrical conductors that make up the transistors themselves and the wires that connect them:
$$ E = \frac{1}{2} C V^2$$
$E$ is the energy requried to charge the conductor. $C$ represents the capacitance of the conductor, which depends on how the logic gates are manufactured in a complex way, and $V$ is the voltage level used to send signals around the computer.
Despite the fact that more and more advanced transistor processes continue to be developed, both capacitance and voltage have largely plateaued. Capacitance has stopped decreasing because it has started to become limited by parasitics, i.e the wires and other supporting structures on the chip rather than the transistors themselves. Voltage has stopped decreasing because if it's made any smaller, the transistors will stop behaving like transistors due to Boltzmann's tyranny.
Due to the halt in progress at the component level, enormous effort has been put into making GPUs better for machine learning by refining their architecture. This includes adding hardware support for low-precision arithmetic, expanding tensor cores, and optimizing on-chip memory and interconnects.
The natural conclusion of the architectural refinement of GPUs is complete specialization for LLM workloads, which we already have today in the form of ASICs.
These ASICs are generally less than an order of magnitude more effecient than the latest GPUs, which places a rough lower bound on the energy efficiency of future GPUs given the halt of progress at the component level.
Due to the relative technological immaturity of LLMs compared to logic gates, it's much harder to predict how their efficiency will change in the future.
We are still actively developing techniques that allow less computationally intensive models to be smarter, such as reducing model size via distillation or to shortcutting internal chains of thought.
However, the unfortunate reality is that even the most far-reaching potential futures considered here are likely much simpler than where things are actually going. Eventually we are going to want AI systems to handle almost all of mundane work that is currently done by humans, and many of these tasks can require intelligence far beyond any AI system that exists today.
For example, although scoring well at a math olympiad is impressive and demonstrates substantial reasoning capabilities, building a system that works in a sterile environment is very far from building something that can do anything in the messy real world. In the real world, you don't just have to think about things. You are also constantly making decisions based on your reasoning and then using your body to implement them. Motor control is famously understood to be far harder than higher reasoning (see Moravec's paradox).
My guess is that despite improvements in computational efficiency, future AI models that can acheive the ultimate objective of automating a large fraction of human labour will be far more computationally intensive than what we have today.
From my perspective, the path forward is clear: to bring advanced AI to the world at large, we need to build far more energy-efficient computing systems than we have today.
If we want to scale fast, the only practical option is to make better use of the technology we already have: integrated circuits. Alternatives like quantum, photonic, or spintronic computing may one day deliver massive efficiency gains, but they’ll be stuck in the lab for a while longer while they reach full maturity. Given what we’ve just seen, the opportunity cost of waiting that long is intolerable.
To make integrated circuits more efficient, we have to tackle the charging-energy problem head-on. Luckily, there are several ways that this can be solved that don’t rely on miracle breakthroughs in materials science.
For example, digital logic can be redesigned to operate at far lower voltages using subthreshold circuits, which trade speed for energy savings that can reach orders of magnitude. Cryogenic computing can push efficiency even further by changing the thermodynamic constants that relate on-voltage and leakage. Reversible computing aims to reduce the wastefulness of digital logic by recycling the energy used to charge and discharge wires (like regenerative braking for computation), an approach already demonstrated at small scales in standard semiconductor processes.
Further gains could come from going beyond digital. Since most AI models already operate at low precision, specialized analog accelerators could exploit physical effects to run parts of a model with extreme efficiency. Analog is hard because it’s less tolerant to variation than digital, but the skyrocketing energy demands of AI likely make it worthwhile.
Moreover, co-designing AI algorithms and hardware (instead of largely developing them separately like we do today) should lead to the discovery of completely different architectures that are far more hardware efficient than LLMs running on GPUs.
Realistically, a solution to the AI energy crisis will involve building a system that combines many of these approaches, and I think a lot more people should be working on this.
Because, we HAVE NOT solved all the hard problems in AI: in fact, our work has really just begun.
Today’s most capable AI systems are dominantly autoregressive large language models (LLMs) based on transformer neural networks. For most LLMs, the number of FLOPs required to run inference is well-approximated by the simple formula,
$$ C = 2 P n $$Where P quantifies the size of the model and is the number of parameters that are typically active during inference. n represents the amount of information seen by the model during the inference run, and is the total number of tokens the model processed (both inputs and outputs).
This simple picture is complicated by the fact that most generally useful AI models today incorporate a large amount of reasoning, which involves generating internal “thinking” tokens that are never seen by the user. We will incorporate this into our model by introducing the reasoning multiplier M, such that,
$$ n = M N $$Where N is the number of “visible” tokens seen by the user, i.e the inputs and outputs of the LLM. An M value of 1 means that the model does no reasoning (like the original GPT 4, for example) and therefore generates no internal tokens. Larger M values mean that the model is thinking for a longer time before it gives you an answer.
Details about the type of computer being used to run the model can be used to connect FLOP counts to the energy consumption E. For the sake of our fermi estimate, we will stick to a simple model of the hardware,
$$ E = C \mathcal{E} $$Here, $\mathcal{E}$ is a parameter that represents the theoretical maximum efficiency of the hardware that is running the model. It has units of [Energy/FLOP], and can be calculated from information provided in the manufacturer’s datasheets. This model underestimates the energy consumption of a real system.
These basic parts are combined to form the equation used in our analysis.
Each of the parameters in our power model strongly influences its predictions. Some of the parameters are easy to pin down (within an order of magnitude, at least) using well known facts about today’s technology. The others can take on a range of values that depend on how the AI system is being used.
Let’s start with an easy one; the size parameter P. Although the exact parameter counts of the AI models people use everyday are largely kept secret, it is generally understood that most of the powerful LLMs released in the last few years have between $10^{11}$ and $10^{12}$ parameters active at inference time. Note that this is not necessarily the total number of parameters in the model, which can be much larger in a mixture of experts system (like Google’s Gemini 2.5, for example).
It is also straightforward to estimate the energy parameter from the accelerator manufacturer’s specifications. For example, NVIDIA generally specifies both a peak power and FLOPs/second for their GPUs, the ratio of which gives $\mathcal{E}$. The below table shows values computed in this way for a few different NVIDIA GPUs assuming the common 16 bit precision floating point format.
| GPU | $\mathcal{E}$ [Joules/FLOP] |
|---|---|
| RTX 4090 | $1.4 \times 10^{-12}$ |
| A100 | $1.3 \times 10^{-12}$ |
| H100 | $7.1 \times 10^{-13}$ |
| B200 | $4.8 \times 10^{-13}$ |
On the other hand, the reasoning multiplier M can vary significantly depending on the specifics of the model being used and the problem that is being solved.
There are several different ways that reasoning happens in LLM systems. In general, reasoning capabilities can be imbued into a model during training by encouraging it to traverse all of the steps that lead to solving a problem explicitly instead of jumping right to the answer. Alternatively, for special problems where it is easy to score how “good” an answer is (such as math), various search-like algorithms can be used to repeatedly query a model to produce progressively better answers. The former is more relevant to general purpose AI applications (like Q&A) and the latter is more relevant to “agentic” scenarios in which a model is trying to solve a complicated problem with an easily verifiable solution.
Commercial general purpose LLM systems do not expose their reasoning tokens, but we can still get a sense of the range of typical M values by studying the amount charged by providers to query different tiers of models. This isn’t an exact method because compute utilization isn’t perfectly correlated with the cost to the user (as providers can choose to make or lose a different amount of money on each different type of model if they want). However, for the sake of our order of magnitude estimate this should be a reasonable starting point.
Taking the example of OpenAI, there is a clear cost hierarchy between the simplest (nano) and the most complex (pro) models. If we assume that a nano model performs no reasoning (which may or may not be true), we can use this cost hierarchy to roughly establish a corresponding reasoning hierarchy, which is summarized in the below table.
| GPT 5 Model | M |
|---|---|
| Nano | 1 |
| Mini | 5 |
| Standard | 25 |
| Pro | 200 |
From this, we can conclude that M values between 10 and 100 are typically achieved for things people are commonly using AI for today.
It’s not clear how large M will have to be to make the jump from the relatively simple capabilities of today’s models to something that could reliably navigate the virtual (or even physical) world autonomously.
What we do know is that M can get extremely large when LLMs are used to solve difficult problems.
For example, it is straightforward to estimate that the current highest-performing systems on the ARC-AGI 2 benchmark use an M value in the $10^{4}$ - $10^6$ range. In particular, we can use the reported per-problem cost of the current highest performing system (J. Berman, 2025), along with facts about the cost of electricity and the efficiency of GPUs to come up with an esimate of the total number of FLOPs used. Once we have an estimate for the number of FLOPs used, we use that to back out an estimate for M given a rough idea for the size of the model and the number of input/output tokens associated with each problem. For the particular case of J. Berman's system, we find that M was roughly $5 \times 10^5$.
Esimtating M for the IOI gold medal systems is more complex because nothing was published about how they worked. For example, we know that OpenAI's system worked for four and a half hours to solve six problems, but we don't know how many GPUs were used to run the models. OpenAI regularly spins up hundreds of thousands of GPUs for large training runs, so it's fair to think that they were motivated to spin up at least 100 for IOI, given the amount of positive publicity the result gave them. Given 100 GPUs, we can calculate $M \approx 2.5 \times 10^6$, which is the number we used in our analysis. In reality, OpenAI probably used many more than 100 GPUs for IOI (because why wouldn't they), which would inflate our energy projections even further.
Another parameter that strongly depends on the context in which the model is being used is $N$, the number of tokens that have to be processed by the AI system in some given time window. Here, we will consider a specific scenario to help make this concrete.
In particular, let’s examine the most basic vision for an AI enabled future that has been ubiquitously advertised by silicon valley CEOs: always on AI virtual assistants for everyone. A simple version of such a system might interact with text that an individual sees every day, which on average is around 100k tokens. A more advanced assistant version may also interact with some amount of what we see (since this is how humans get most of their information). While multi-modal AI systems are a bit more complicated than what our simple FLOP counting model can handle, we can approximate the additional computational burden imposed by images by looking at how images are billed in LLM APIs. For example, users of the Gemini API pay a few hundred tokens for every small image they pass in to a model. The below table summarizes the total token load for a few different versions of the AI assistant.
| Assistant Type | N/day |
|---|---|
| Text only | $2 \times 10^{5}$ |
| Video @ 1 FPS | $1.1 \times 10^7$ |
| Video @ 5 FPS | $5.5 \times 10^7$ |
In the text assistant case, we have assumed an equal amount of input and output tokens, whereas for the video assitants we have assumed the visible token load is input-dominated.