Salvatore Sanfilippo (creator of Redis) shipped ds4 (DwarfStar 4) in roughly one week. It is a single-file 18,404-line C inference engine, built only for DeepSeek V4 Flash, and as of late May 2026 it has 11,185 GitHub stars. The software wall around running a 284B-parameter frontier model locally is gone. The hardware wall is not. q2 needs 96-128 GB of unified memory, q4 needs 256 GB or more, and only Mac Studio M3 Ultra 512 GB really plays at the top end. This post breaks down the asymmetric IQ2_XXS quantization that makes ds4 work, the official Mac Metal benchmarks from the project README, a three-year buy-vs-rent TCO table in real dollars, and a seven-step setup that ends with Cursor talking to a DeepSeek V4 Flash instance you do not own.
The reception of ds4 has been unusually warm for a one-week side project. Towards AI tested it on 18 tasks and called the experience "a lot more B than A," where B is the frontier API and A is small local models. None of that matters if you cannot get the bytes onto Metal. Five gates explain why.
The honest summary: ds4 took down the software wall, but the hardware wall is taller and softer than it looks. You pay for it once at the register, then again every time the model under your engine becomes a generation old.
The defining bet in ds4 is single-model focus. llama.cpp targets hundreds of architectures. MLX is a general Apple Silicon ML stack. ds4 is a Metal graph executor written for the exact layer geometry of DeepSeek V4 Flash. Hand-tuned kernels, dedicated KV handling for routed MoE, on-disk KV protocol, prompt rendering, tool calling, plus an OpenAI- and Anthropic-compatible HTTP server in the same binary. The cost: when V4 Flash is superseded, much of the engine has to be rewritten. The benefit: today, on a 128 GB Mac, it is the shortest path to a working setup.
| Dimension | ds4 (antirez) | llama.cpp + generic GGUF | MLX (Apple) |
|---|---|---|---|
| Models supported | DeepSeek V4 Flash only, custom GGUFs | Hundreds of architectures | Most major models, Apple Silicon-first |
| Backend | Metal (primary) + CUDA (Linux); CPU only for correctness | CPU / CUDA / Metal / Vulkan | Apple Silicon Metal native |
| 2-bit quant strategy | Asymmetric: IQ2_XXS + Q2_K on routed experts; Q8/F16 elsewhere | Symmetric (IQ2 / IQ3 / Q4_K_M, etc.) | 4-bit / 8-bit general |
| On-disk KV | Native via --kv-disk-dir | External tooling required | None built in |
| 1M context | Native | Possible with tuning | Depends on model |
| OpenAI / Anthropic API | Built into ds4-server | Wraps via llama.cpp server or extra layer | Extra layer required |
| Build complexity | git clone && make | Pick backend, install deps | pip install |
| V4 Flash speed today | Among the fastest on Mac | Workable but not optimal | Not specifically tuned |
How to read this: pick ds4 if V4 Flash is the only model you care about and speed matters. Keep llama.cpp if you also rotate Llama, Qwen, Mistral, or Phi. MLX makes sense for research experiments inside the Apple ecosystem. Most teams I have seen run ds4 and llama.cpp on the same remote Mac and route requests by task.
The table below combines four facts on one row: model tier, minimum unified memory, the actual Mac it implies, US retail price band, and benchmark numbers from the ds4 README. All prefill and generation figures are from the official measurement table for MacBook Pro M3 Max 128 GB and Mac Studio M3 Ultra 512 GB. US prices reflect Apple online store as of May 2026, top-spec configurations.
| Tier | Min memory | Mac (top spec) | US retail (USD) | Prefill (official) | Generation (official) | Best fit |
|---|---|---|---|---|---|---|
| V4 Flash q2 | 96 GB (128 GB recommended) | MacBook Pro M3/M4/M5 Max 128 GB | $4,000-$5,000 | Short prompt 58.52 t/s; 11,709-token long prompt 250.11 t/s (M3 Max 128 GB) | 26.68 t/s short; 21.47 t/s long (M3 Max 128 GB) | Solo developer, coding agent, single-user inference |
| V4 Flash q4 | 256 GB or more | Mac Studio M3 Ultra 256 / 512 GB | $7,000-$14,500 | Short 78.95 t/s; 12,018-token long prompt 448.82 t/s (M3 Ultra 512 GB) | 35.50 t/s short; 26.62 t/s long (M3 Ultra 512 GB) | Quality ceiling, longer contexts, small team sharing |
| V4 Flash q2 (Ultra) | 128 GB (512 GB recommended) | Mac Studio M3 Ultra 512 GB | $14,500+ | Short 84.43 t/s; 11,709-token long prompt 468.03 t/s | 36.86 t/s short; 27.39 t/s long | Speed maximization, agent swarms, long sessions |
| V4 Pro q2 | 512 GB (effective) | Mac Studio M3 Ultra 512 GB top spec | $14,500+ | No reproducible community benchmarks (ds4 supports Flash only) | — | Not a ds4 path; multi-H100 / H200 rigs are more realistic |
Reading note: One Mac Studio M3 Ultra 512 GB can run either q2 (for speed) or q4 (for quality). The 256 GB Ultra cannot run q4 at top quality and cannot reach the 512 GB tier numbers. MacBook Pro caps at 128 GB; q4 and Pro are off the table on a laptop.
ds4 lists Metal as the primary target. That is not a marketing decision. It follows directly from the V4 Flash hardware envelope.
--kv-disk-dir protocol is designed to write KV slabs in patterns those drives handle well. On a typical desktop PC SATA SSD it would not be usable.Net: a high-memory Mac is, for now, the only consumer-grade hardware that lands in the "fits, runs fast enough, costs less than a car" intersection. That is also why this wave of local frontier-model inference is happening on Mac, not on x86.
The path below is the shortest validated route from zero to "Cursor sends a prompt and DeepSeek V4 Flash answers it." It works on a local 128 GB MacBook Pro, on a local 512 GB Mac Studio, and on a remote Mac over SSH. The only difference for remote is one extra port-forwarding line.
git clone https://github.com/antirez/ds4 && cd ds4 && make. Apple Silicon defaults to Metal. Do not run make cpu; the CPU path will crash macOS../download_model.sh q2 for 128 GB or ./download_model.sh q4 for 256 GB+. The script pulls from huggingface.co/antirez/deepseek-v4-gguf with curl -C - resume support. q2 is 80.8 GiB; q4 is 153.3 GiB../ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192. Do not jump straight to --ctx 1000000; that pulls memory you may not have and triggers swap. Start at 100K and grow.curl http://localhost:8080/v1/models. You should see deepseek-v4-flash. A POST /v1/chat/completions with a one-line prompt confirms generation works.http://<ds4-host>:8080/v1. Model: deepseek-v4-flash. API key: any string (ds4 does not validate by default).ssh -L 8080:localhost:8080 user@mac-host. Then point Cursor at http://localhost:8080/v1. The experience is indistinguishable from local. Full topology in the SSH local-forward to dedicated remote Mac runbook.git clone https://github.com/antirez/ds4 cd ds4 ./download_model.sh q2 # 96-128 GB Mac make # Metal default; never make cpu ./ds4-server \ --ctx 100000 \ --kv-disk-dir /tmp/ds4-kv \ --kv-disk-space-mb 8192 # From a laptop to a remote Mac running ds4-server: ssh -L 8080:localhost:8080 dev@mac-rental.example.com
Capex stories tend to flatter the buyer because they rarely model frontier-model iteration risk or SSD wear. The table below does. Buy side assumes US retail and a 50% three-year residual, which is generous for hardware that will be a generation behind by 2027. Cloud-rent side uses MACCOME's public rate card and assumes flexible monthly use, with hourly available for short experiments. Power, networking, replacement, and physical space are excluded from the buy side; including them widens the gap.
| Option | Up-front Capex (USD) | 3-year cost | Residual at year 3 (50%) | Net 3-year spend | Flexibility |
|---|---|---|---|---|---|
| Buy: Mac Studio M3 Ultra 256 GB | $7,000 | $7,000 (depreciation absorbed) | +$3,500 recovered | ~$3,500 (assuming you can sell it) | None: needs a new machine to move to q4 / Pro |
| Buy: Mac Studio M3 Ultra 512 GB top spec | $14,500 | $14,500 | +$7,250 recovered | ~$7,250 | None; biggest residual risk on iteration |
| Rent: MACCOME 128 GB monthly | $0 | 36 months at the public monthly rate | — | Typically 30-50% of buying top spec, lower if you stop and go | High: switch to 256 / 512 GB any month |
| Rent: MACCOME hourly (short experiments) | $0 | Pay only the hours you used | — | Very low for a short POC | Maximum: spin up, run, terminate |
Reframed in plain language: buying the 512 GB Ultra is paying $14,500 today plus three years of iteration risk plus SSD wear. Renting moves all three of those risks to the platform. The math only favors buying if you are sure you will saturate the box for 18+ months. The same logic shows up on a smaller dollar scale in the Mac mini M4 buy-vs-rent decision matrix, except here every variable is multiplied by ten and the residual risk is sharper.
antirez has demonstrated something specific. Running a 284B-parameter frontier model locally is now a software-solved problem. 18,404 lines of C, Metal-first, on-disk KV, OpenAI- and Anthropic-compatible endpoints, an integrated coding agent. The barrier left is the 96 to 512 GB hardware wall, and for individuals it is not just expensive: it is brittle.
If you actually account for the full picture, owning a top-spec Mac Studio Ultra to run ds4 has three quiet costs. (a) A $4,000 to $14,500 one-time hit that locks cash flow and depreciates fast as the next model lands. (b) Real SSD wear from running 1M-context sessions over years. (c) Zero elasticity: today you want q2, next quarter you want q4, but you still own only one machine. For anyone running a serious AI Agent stack or a small team that wants to share a single Ultra across experiments, MACCOME's high-memory remote Macs (128 / 256 / 512 GB tiers) on monthly or hourly rent are usually the cleaner answer. The Capex becomes Opex, the residual risk moves to the platform, and the same machine can be split among teammates.
To plan node selection by region, see the 2026 multi-region Mac node guide. To pair local inference with cloud-API routing, the 2026 OpenRouter rankings and routing matrix covers the "cloud API + local inference" layer that complements ds4. Read in that order, the picture is complete: where to host the model, and how to layer it with managed APIs.
FAQ
Can ds4 run DeepSeek V4 Pro instead of Flash?
No. ds4 only supports the DeepSeek V4 Flash GGUFs from antirez/deepseek-v4-gguf (q2, q4, MTP). It is not a generic GGUF loader. V4 Pro has 1.6T total parameters and 49B activated; even a q2 quant requires 512 GB-class unified memory or multi-H100 / H200 rigs. There is currently no reproducible community path for V4 Pro under ds4. If you need to evaluate Pro short term, consider trying a top-spec Ultra over a llama.cpp multi-GPU path on the order page.
Is q2 quantization quality noticeably worse?
q2 in ds4 is asymmetric. Only routed MoE experts are aggressively quantized (IQ2_XXS for gate/up, Q2_K for down). Shared experts, attention projections, router, embed, indexer keep Q8_0 / F16 / F32. Independent 18-task tests reported behavior close to frontier APIs for tool calling and code generation. Move to q4 (256 GB+) for high-precision math reasoning or multimodal tasks.
Will network latency hurt ds4 on a remote Mac when accessed via Cursor?
ds4-server speaks OpenAI- and Anthropic-compatible HTTP. Connect via SSH local forwarding or Tailscale. Same-region RTT is typically 5-30 ms, imperceptible during streaming. Cross-region adds about 150 ms first-token latency, but the streaming flow after that is essentially identical to local. See the support center for region pairing notes.
Why not just use llama.cpp or MLX?
llama.cpp and MLX are general-purpose runtimes covering hundreds of architectures. ds4 is a Metal graph executor written specifically for V4 Flash. Hand-tuned kernels, dedicated KV handling for routed MoE, on-disk KV protocol, prompt rendering, tool calling, all targeted at one model. The trade-off is single-model focus, but on a 128 GB Mac it is the shortest path to a working V4 Flash setup. Many teams install ds4 alongside llama.cpp on the same remote Mac and route per task.