The World’s Biggest AI Cluster Is Mostly Sitting Idle
AI compute efficiency has become the quiet fault line separating the companies that are winning the AI arms race
AI compute efficiency has become the quiet fault line separating the companies that are winning the AI arms race from the ones that are merely spending the most money on it. Nowhere is that gap more visible right now than at xAI, Elon Musk’s artificial intelligence company, which operates what is believed to be the world’s largest single-site AI training cluster, and is currently using roughly 11% of it.
That figure came from an internal memo written by Michael Nicolls, the former SpaceX executive who was installed as xAI’s president earlier this year. In the memo, Nicolls described xAI as “clearly behind” in the AI competition and cited the company’s Model FLOPs Utilisation rate, or MFU, as the evidence. MFU measures how much of a system’s theoretical computing capacity is actually being used during AI training. An MFU of 11% means that for every ten chips running in xAI’s Colossus supercomputer, roughly nine are sitting idle.
For a company that has staked its competitive positioning on having more raw compute than anyone else, this is an awkward number to have leaked.
The Colossus Problem
xAI’s Colossus facility in Memphis, Tennessee, was built with remarkable speed. The company identified a vacant former Electrolux factory, repurposed it, and had the first 100,000 Nvidia GPUs operational within 122 days of breaking ground. By early 2026, the site had expanded to around 550,000 GPUs, a combination of H100 and H200 chips purchased for approximately $18 billion. A third building has since been acquired to expand capacity toward two gigawatts of total compute power, with ambitions for one million GPUs in total.
The construction story is genuinely impressive. The business story is more complicated.
According to reporting by The Information, industry benchmarks from Lambda AI put the normal range for AI compute efficiency in large-scale training between 35% and 45% MFU. Meta is operating at around 43%. Google sits at approximately 46%. xAI, with a larger cluster than either, is at 11%. A researcher at a competing lab described the figure to The Information as “ridiculously low.”
Nicolls set an internal target of 50% MFU in his memo, which would put xAI above its rivals if achieved. He did not specify a timeline.
Why This Happens
Low GPU utilisation is not unique to xAI. AI compute efficiency is a structural problem across the industry, and one that becomes significantly harder to manage as clusters scale.
At smaller deployments of a few thousand chips, idle time is manageable. Training runs flat out, researchers analyse results, decisions get made about what to adjust, and training resumes. The gap between those phases is tolerable. But at the scale xAI is operating, with hundreds of thousands of chips distributed across a massive facility, even small inefficiencies in the software stack multiply fast. A weak link anywhere in the network connecting the GPUs can throttle the performance of the entire cluster. High-bandwidth memory struggles to keep pace with the compute cores. Checkpointing mechanisms that pause training to save progress against the risk of hardware failure add further idle time.
The software layer required to coordinate training across that many chips simultaneously is extraordinarily complex, and xAI’s distributed training stack has not yet matured to handle it. That is not a hardware problem. It is an engineering execution problem.
There is also, according to sources cited in the original reporting, a cultural dimension to how utilisation figures get managed inside AI labs. Researchers at large facilities have been known to rerun completed training experiments specifically to keep GPU utilisation metrics elevated, partly to avoid criticism from management and partly to prevent idle chips from being reassigned to other teams. The number on paper and the number representing genuinely productive training are not always the same.
Renting Out the Problem

One of xAI’s responses to the utilisation gap has been to turn the excess capacity into a revenue stream. According to reporting by Business Insider, xAI is planning to supply computing power to Cursor, a coding AI startup currently valued at around $50 billion, allowing Cursor to train its upcoming model, Composer 2.5, on tens of thousands of xAI GPUs.
The arrangement effectively repositions xAI as a cloud infrastructure provider, following the model of Amazon Web Services, Microsoft Azure, and Google Cloud, as well as specialist GPU rental outfits like CoreWeave and Lambda. By renting idle capacity to external companies, xAI can generate revenue from hardware it is not fully using internally while also deepening ties with companies that have access to valuable training data.
The Cursor deal is not purely transactional. In March, xAI hired two of Cursor’s senior product engineering leads, Andrew Milich and Jason Ginsburg, who now oversee xAI’s product team and report directly to Musk and Nicolls. The personnel moves preceded the compute arrangement by several weeks, suggesting the relationship between the two companies is more intertwined than a simple cloud rental deal.
It is also worth noting what the Cursor arrangement implicitly signals. A company that genuinely needed all of its compute capacity for its own training would not be renting it out. Offering GPU access to a third party is a rational use of otherwise idle infrastructure, but it is also an admission that the infrastructure is, in fact, otherwise idle.
The Real Competition
The AI arms race has been publicly narrated as a competition for chips. Who has the most Nvidia GPUs, who can build the biggest cluster, who can secure the next allocation before a rival does. Musk said during an all-hands meeting in December last year that xAI would beat OpenAI and Anthropic specifically because it would have access to more compute power to train its models. The logic was straightforward: more chips, better models, faster.
What the xAI utilisation story exposes is that AI compute efficiency is the harder and less glamorous problem: the distributed training frameworks, the memory bandwidth optimisation, the network fabric tuning, the fault tolerance engineering that keeps a cluster of hundreds of thousands of chips running as a coherent system rather than an expensive pile of silicon. Meta and Google have spent years building and refining those software stacks. They did not achieve 43% and 46% MFU by accident. They achieved it through the kind of unglamorous infrastructure engineering that does not generate headlines.
xAI built its hardware faster than anyone expected. The software that would allow it to actually use that hardware efficiently is still catching up.
What Comes Next
Nicolls has signalled that the path to 50% MFU runs through infrastructure and software stack optimisations, with the compute infrastructure team now led by SpaceX’s Daniel Dueri following a leadership reshuffle that also saw the previous infrastructure lead depart. Whether the target is achievable within months, as the memo implied, or whether it represents another round of aspirational goal-setting is an open question.
In the meantime, xAI is moving on multiple fronts. The TeraFab project, Musk’s in-house silicon initiative, aims to reduce dependence on Nvidia by developing proprietary chips for future training and inference workloads. xAI is also in discussions with investors including Saudi Arabia’s Public Investment Fund about raising a further $20 billion in capital, which would push its valuation toward $170 billion.
The ambition is unchanged. The question is whether the operational execution can catch up with the infrastructure build before the capital runs out or the competitive window closes.
Owning the world’s largest AI cluster matters less than it sounds if the cluster is mostly sitting idle. In the AI arms race, the most important resource turns out not to be the chips themselves, but the AI compute efficiency that determines whether those chips actually do anything useful.
Sources:
The Information. (2026, May 2). xAI shows how hard it is to use a lot of GPUs at once.
Techzine Global. (2025, December 31). xAI expands Colossus megadata center to 2 gigawatts.



