Exo turns a stack of Macs into one AI supercomputer — but should you actually use it?
A few weeks ago I wrote about running AI locally on a Mac Mini — one machine, one model, done. The constraint was always memory. A Mac Mini M4 Pro with 24GB can run a 14-billion-parameter model comfortably. Maybe a quantized 70B if you stretch to 48GB. But the really capable open-source models — DeepSeek V3 at 671 billion parameters, Llama 405B, Qwen3 235B — those need hundreds of gigabytes of memory. No single Mac can touch them.
Unless you connect several Macs together.
That’s what Exo does. It’s an open-source tool (Apache 2.0, 43,000+ GitHub stars) that pools multiple Apple Silicon devices into a single AI inference cluster. You run it on each machine, they discover each other automatically, and Exo splits the model across all of them. Your four Mac Minis become one 256GB-memory AI machine.
I spent a week digging into this — the official benchmarks, the architecture, and (more importantly) what people who actually use it every day say on Reddit. The gap between the marketing and the reality is instructive.
How it actually works
Exo uses two parallelism strategies depending on your hardware and network.
Pipeline parallelism splits the model into sequential slices. Device 1 runs layers 1-30, device 2 runs layers 31-60, and so on. Each device processes its layers and passes a tiny activation tensor (under 4KB for many models) to the next machine. This works over regular Ethernet or even Wi-Fi, though Wi-Fi performance is terrible — don’t bother.
Tensor parallelism is the newer, faster approach. Instead of splitting the model into sequential chunks, it splits individual tensor operations across devices. This requires much faster interconnects — specifically Thunderbolt 5 with RDMA (Remote Direct Memory Access). The payoff is significant: memory access latency drops from ~300 microseconds over TCP to under 50 microseconds with RDMA.
Both approaches use MLX under the hood — Apple’s machine learning library built specifically for Apple Silicon’s unified memory architecture. Exo handles the orchestration — device discovery, topology-aware sharding, load balancing — so you don’t have to manually configure which layers go where.
You start it with uv run exo and it serves an API on localhost:52415. That API is compatible with OpenAI’s chat completions format, Claude’s Messages API, and even Ollama’s API. So whatever tools you’re already using to talk to an AI model will probably work without changes.
The official benchmark numbers
Real benchmark data exists from multiple independent sources, and the numbers are genuinely impressive.
Exo’s own benchmarks (transparent, reproducible — they published the methodology) tested LLaMA 3.2 3B across M4 Pro machines:
- 1 device: 49.3 tokens/sec
- 2 devices: 44.4 tok/s single-request, 95.7 tok/s multi-request
- 3 devices: 39.7 tok/s single-request, 108.8 tok/s multi-request
Notice something? Single-request latency actually gets worse as you add devices. Each device hop adds network overhead. But total throughput scales well — 2.2x on three devices for parallel requests. If you’re running an AI agent that handles multiple tasks simultaneously, the throughput gain matters more than single-request speed.
Jeff Geerling’s Mac Studio cluster — four M3 Ultra Mac Studios with 1.5TB of combined memory, connected via Thunderbolt 5 RDMA — ran Qwen3 235B at 32 tokens/sec across the full cluster. For comparison, llama.cpp on the same hardware slowed down from 20.4 tok/s (one node) to 15.2 tok/s (four nodes). Exo with RDMA went the opposite direction: 19.5 tok/s (one node) up to 31.9 tok/s (four nodes).
That’s the headline that gets people excited. With the right interconnect, adding machines makes things faster, not slower.
DeepSeek R1 671B on two Mac Studios — independent benchmarker Ivan Fioravanti ran full 8-bit DeepSeek R1 across two M3 Ultra 512GB Mac Studios. The numbers were solid but showed a pattern: 18.6 tok/s generation at 442 tokens of context, dropping to 15.4 tok/s at 1,074 tokens, and cratering to 6.4 tok/s at 13,140 tokens. At 16K context, it ran out of memory entirely.
That context-length degradation is something the official benchmarks don’t emphasize. Short prompts look great. Long conversations slow to a crawl.
DeepSeek V3 671B on eight Mac Minis — Exo’s co-founder Alex Cheema demonstrated the 671B model across eight M4 Pro Mac Minis (64GB each, 512GB total). About 5 tokens/sec initially, with later optimizations pushing to 27.8 tok/s on two nodes and 32.5 tok/s on four.
Why does a 671B model sometimes run faster than Llama 70B on the same hardware? DeepSeek V3 uses a Mixture-of-Experts (MoE) architecture. It has 671 billion total parameters but only activates about 37 billion per token. Llama 70B uses all 70 billion every time. The “bigger” model does less work per token.
What Reddit actually says
I went through about a dozen threads on r/LocalLLaMA and r/LocalLLM. The vibe is… different from the benchmarks.
The most common sentiment: Exo works for demos. It does not work for daily use.
One user with dual M3 Ultra Mac Studios — about $40,000 in hardware — put it bluntly: “I could only get exo working occasionally and when it did it only worked well enough to perform demos and not real work. I couldn’t get 8-bit MLX DeepSeek R1 to run with mlx distributed or exo. None of these clustering solutions are fully baked.”
Another: “Worst clustering software ever. It’s fine as a proof of concept but you’ll get sick of it and quit using it in 10 minutes if you’re a normal user.”
And from a thread asking “Is anyone actually running an exo cluster?” — the top replies were variations of “I’ve experimented with it, but for day-to-day usage I rely on Ollama.”
Now, some of these complaints are from before the 1.0 release, which brought tensor parallelism and RDMA support. Things have improved. But even recent threads describe reliability issues, unexpected crashes, and setup headaches.
The criticism clusters around a few specific problems:
Platform support is Mac-only in practice. Exo technically runs on Linux, but CPU-only — no GPU acceleration. Windows doesn’t work at all (a dependency called uvloop doesn’t support it). The Nvidia DGX Spark demo is impressive, but regular Linux users with Nvidia GPUs can’t benefit yet.
You need exactly 2 or 4 nodes. Tensor parallelism requires power-of-two node counts. You can’t cluster three machines. As one Redditor noted: “which is why the YouTube videos NEVER demo 3 nodes.”
Long context kills performance. This showed up in multiple threads. Generation speed at short context lengths looks great; by 13K tokens it’s dropped 65%. If your use case involves processing long documents or extended conversations, this is a problem.
Setup is fragile. Multiple users report that getting Exo configured and stable requires significant technical effort, and it breaks between updates.
The honest cost math
A Reddit user broke down the cost-per-performance numbers:
- 4× Mac Studio M3 Ultra 512GB = ~$40K → ~25 tok/s on DeepSeek
- 8× Nvidia RTX PRO 6000 96GB = ~$64K → ~27 tok/s
- 8× Nvidia B100 192GB = ~$300K → ~300 tok/s
Their summary: “You pay $1,000 for each token/second.” Brutal, but honest.
A more modest cluster of four Mac Mini M4 Pro machines with 64GB each:
- 4× Mac Mini M4 Pro 64GB: roughly EUR 7,000-8,000
- Thunderbolt 5 cables and a hub: EUR 200-400
- Electricity: ~200 watts total at full load, about EUR 15-20/month
That gives you 256GB of unified memory. Enough for models up to about 200 billion parameters at 4-bit quantization. For DeepSeek V3 671B, you’d need eight nodes (512GB), doubling the cost to around EUR 15,000.
There’s a counterargument that Redditors consistently make in favor of the Mac approach: resale value. One user: “A Mac Studio retains value incredibly well and can be used for all kinds of creative workflows, making it a much, much safer investment.” Buy four Mac Minis, the AI landscape shifts in two years, and you’ve still got four perfectly good computers. Try reselling an Nvidia A100 to a dental practice.
I covered the broader self-hosted vs. cloud cost analysis in a separate post, but the short version: for most small businesses spending EUR 50-150/month on cloud APIs, a single Mac Mini is the better local option. The cluster play only makes sense if you specifically need models that won’t fit in one machine’s memory.
The alternatives people actually use
Reddit surfaced some practical alternatives worth knowing about.
llama.cpp RPC mode. Multiple people called this out as simpler and more reliable for distributed inference. It works cross-platform (Windows, Linux, Mac), doesn’t require Thunderbolt, and one user described it as “actually quite easy — like super easy.” It lacks Exo’s automatic discovery and fancy UI, but it works.
ktransformers. For running DeepSeek specifically, some users reported an AMD EPYC CPU + RTX 5090 GPU setup running DeepSeek R1 with FP8 attention and Q4 experts at 14.5 tok/s for about $5K. Less elegant than a stack of Mac Minis, but cheaper and faster for that one model.
Just using one bigger machine. The unsexy answer. A single Mac Studio M3 Ultra with 192GB runs 70B+ models without any clustering overhead. No network latency, no node coordination, no powers-of-two restrictions. Several Redditors noted that if a model fits in one machine, Exo adds complexity for zero benefit.
What’s genuinely impressive (despite all that)
I don’t want to bury the achievement here. Some things are real.
RDMA over Thunderbolt 5 — getting 99% latency reduction on consumer hardware over Thunderbolt cables is serious engineering. It’s what separates Exo from earlier attempts at Mac clustering that just made things slower.
Automatic device discovery. Run exo on each machine and they find each other. No manual IP configuration, no config files. For a distributed system, that’s table stakes for usability, and Exo actually delivers it.
API compatibility. OpenAI, Claude, and Ollama APIs. Point your existing tools at localhost:52415 and they work.
The DGX Spark hybrid demo. Connecting Apple Silicon and Nvidia hardware in one inference cluster — different architectures, different vendors — with a 2.8x performance gain. That’s a preview of where distributed inference is heading, even if it’s not production-ready today.
MoE model performance. The fact that DeepSeek V3 (671B) generates tokens faster than Llama 70B on the same cluster is counterintuitive and genuinely useful. MoE models are going to keep getting bigger while staying fast, and Exo handles them particularly well.
Who should actually consider this
AI hobbyists and researchers who want to run frontier-scale models locally and don’t mind troubleshooting. If you’ve got multiple Macs and you want to experiment with 200B+ parameter models, Exo is the most accessible way to do it.
Businesses — not yet. I wouldn’t deploy this for a client right now. The software isn’t reliable enough for production use, the setup is fragile, and for most business automation tasks, a single Mac Mini with Ollama or a cloud API is simpler, cheaper, and more dependable. When Exo matures — stable releases, proper Linux GPU support, proven uptime — I’ll revisit this. The architecture is sound. The execution needs more time.
The exception: if you’re a business that specifically needs to run very large models locally for privacy reasons — 200B+ parameters, processing sensitive documents — and you have technical staff who can babysit the cluster, it might be worth evaluating. But go in with eyes open about the current limitations.
Where this is heading
Exo’s trajectory is clear. Version 1.0 brought RDMA and tensor parallelism. They’re working on Linux GPU acceleration. They’ve demonstrated mixed Apple/Nvidia clusters. The API compatibility means any tool in the local AI world can use it as a backend.
The broader trend matters more than any one tool: open-source models are getting better, consumer hardware is getting more capable, and the gap between “data center AI” and “under your desk AI” is shrinking. A year ago, running a 671-billion-parameter model required serious infrastructure. Now it requires eight Mac Minis and some patience.
Apple Silicon’s unified memory architecture — where the CPU, GPU, and neural engine all share the same memory pool — turns out to be weirdly well-suited for AI inference. It wasn’t designed for this. But at about $25 per gigabyte of high-bandwidth memory, nothing else in the consumer market comes close. That’s why everyone in r/LocalLLaMA is talking about Mac clusters and not, say, gaming PC clusters.
I think the most interesting angle for small businesses is the risk profile. Buy four Mac Minis, and if the AI landscape shifts, you’ve got four Macs. Your team can use them for other work, or you sell them for 60-70% of what you paid. That’s fundamentally different from buying specialized AI hardware that does one thing.
Exo isn’t ready for my clients today. But I’m watching it closely, and I’d bet on the approach — distributed inference on consumer hardware — becoming a real option within the next year or two.
If you’re curious whether local AI (single machine or clustered) makes sense for your business, let’s talk about it.
Book a free call. I'll tell you exactly what I'd automate first, what hardware you need, and what the whole thing costs. No surprises.
Book a free call