DeepSeek V4 Flash Local on a Mac (2026): Inside antirez's ds4 Engine, the 96-512GB Hardware Wall, and a Buy-vs-Cloud-Rent Decision Table

17 min read · MACCOME

Salvatore Sanfilippo (creator of Redis) shipped ds4 (DwarfStar 4) in roughly one week. It is a single-file 18,404-line C inference engine, built only for DeepSeek V4 Flash, and as of late May 2026 it has 11,185 GitHub stars. The software wall around running a 284B-parameter frontier model locally is gone. The hardware wall is not. q2 needs 96-128 GB of unified memory, q4 needs 256 GB or more, and only Mac Studio M3 Ultra 512 GB really plays at the top end. This post breaks down the asymmetric IQ2_XXS quantization that makes ds4 work, the official Mac Metal benchmarks from the project README, a three-year buy-vs-rent TCO table in real dollars, and a seven-step setup that ends with Cursor talking to a DeepSeek V4 Flash instance you do not own.

Five hard gates that turn "good software" into "still cannot run it"

The reception of ds4 has been unusually warm for a one-week side project. Towards AI tested it on 18 tasks and called the experience "a lot more B than A," where B is the frontier API and A is small local models. None of that matters if you cannot get the bytes onto Metal. Five gates explain why.

  1. 96 GB unified memory is the floor, not the target. The q2 GGUF is 80.8 GiB on disk. After loading, you still need headroom for KV cache and runtime. The community floor is 96 GB; 128 GB is the first comfortable rung. A 64 GB machine will swap aggressively and become unusable.
  2. Metal-only path. CPU inference will crash the kernel. The README is direct: "current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code." That removes any non-Apple-Silicon Mac and any hope of CPU fallback.
  3. The 2-bit quant is unusual on purpose. Only routed MoE experts are aggressively quantized: IQ2_XXS for gate and up, Q2_K for down. Shared experts, attention projections, router, embed, indexer, and high-precision components stay at Q8_0, F16, or F32. The trade only works because antirez ships custom GGUFs in antirez/deepseek-v4-gguf. Generic V4 GGUFs from Hugging Face will not load.
  4. On-disk KV is a feature with a hidden bill. ds4 lets the KV cache spill to a directory you choose. That makes long contexts persistent across sessions, which is rare. It also pushes serious write traffic to the SSD. Apple internal NVMe has finite TBW; running 1M-token sessions 24/7 for three years is real wear, not a thought experiment.
  5. Capex is one-shot and the residual is brutal. A 128 GB MacBook Pro M3/M4/M5 Max is roughly $4,000-$5,000 in the US. A 256 GB Mac Studio M3 Ultra is around $7,000. The 512 GB top-spec Mac Studio crosses $14,000. Frontier models iterate every six to nine months. Resale value drops fast.

The honest summary: ds4 took down the software wall, but the hardware wall is taller and softer than it looks. You pay for it once at the register, then again every time the model under your engine becomes a generation old.

What ds4 actually optimizes (and why that is different from llama.cpp / MLX)

The defining bet in ds4 is single-model focus. llama.cpp targets hundreds of architectures. MLX is a general Apple Silicon ML stack. ds4 is a Metal graph executor written for the exact layer geometry of DeepSeek V4 Flash. Hand-tuned kernels, dedicated KV handling for routed MoE, on-disk KV protocol, prompt rendering, tool calling, plus an OpenAI- and Anthropic-compatible HTTP server in the same binary. The cost: when V4 Flash is superseded, much of the engine has to be rewritten. The benefit: today, on a 128 GB Mac, it is the shortest path to a working setup.

Dimension ds4 (antirez) llama.cpp + generic GGUF MLX (Apple)
Models supportedDeepSeek V4 Flash only, custom GGUFsHundreds of architecturesMost major models, Apple Silicon-first
BackendMetal (primary) + CUDA (Linux); CPU only for correctnessCPU / CUDA / Metal / VulkanApple Silicon Metal native
2-bit quant strategyAsymmetric: IQ2_XXS + Q2_K on routed experts; Q8/F16 elsewhereSymmetric (IQ2 / IQ3 / Q4_K_M, etc.)4-bit / 8-bit general
On-disk KVNative via --kv-disk-dirExternal tooling requiredNone built in
1M contextNativePossible with tuningDepends on model
OpenAI / Anthropic APIBuilt into ds4-serverWraps via llama.cpp server or extra layerExtra layer required
Build complexitygit clone && makePick backend, install depspip install
V4 Flash speed todayAmong the fastest on MacWorkable but not optimalNot specifically tuned

How to read this: pick ds4 if V4 Flash is the only model you care about and speed matters. Keep llama.cpp if you also rotate Llama, Qwen, Mistral, or Phi. MLX makes sense for research experiments inside the Apple ecosystem. Most teams I have seen run ds4 and llama.cpp on the same remote Mac and route requests by task.

Hardware bill: q2 / q4 / Pro mapped to real Macs

The table below combines four facts on one row: model tier, minimum unified memory, the actual Mac it implies, US retail price band, and benchmark numbers from the ds4 README. All prefill and generation figures are from the official measurement table for MacBook Pro M3 Max 128 GB and Mac Studio M3 Ultra 512 GB. US prices reflect Apple online store as of May 2026, top-spec configurations.

Tier Min memory Mac (top spec) US retail (USD) Prefill (official) Generation (official) Best fit
V4 Flash q2 96 GB (128 GB recommended) MacBook Pro M3/M4/M5 Max 128 GB $4,000-$5,000 Short prompt 58.52 t/s; 11,709-token long prompt 250.11 t/s (M3 Max 128 GB) 26.68 t/s short; 21.47 t/s long (M3 Max 128 GB) Solo developer, coding agent, single-user inference
V4 Flash q4 256 GB or more Mac Studio M3 Ultra 256 / 512 GB $7,000-$14,500 Short 78.95 t/s; 12,018-token long prompt 448.82 t/s (M3 Ultra 512 GB) 35.50 t/s short; 26.62 t/s long (M3 Ultra 512 GB) Quality ceiling, longer contexts, small team sharing
V4 Flash q2 (Ultra) 128 GB (512 GB recommended) Mac Studio M3 Ultra 512 GB $14,500+ Short 84.43 t/s; 11,709-token long prompt 468.03 t/s 36.86 t/s short; 27.39 t/s long Speed maximization, agent swarms, long sessions
V4 Pro q2 512 GB (effective) Mac Studio M3 Ultra 512 GB top spec $14,500+ No reproducible community benchmarks (ds4 supports Flash only) Not a ds4 path; multi-H100 / H200 rigs are more realistic
warning

Reading note: One Mac Studio M3 Ultra 512 GB can run either q2 (for speed) or q4 (for quality). The 256 GB Ultra cannot run q4 at top quality and cannot reach the 512 GB tier numbers. MacBook Pro caps at 128 GB; q4 and Pro are off the table on a laptop.

Why Apple Silicon is the right host (and why x86 + discrete GPU is not)

ds4 lists Metal as the primary target. That is not a marketing decision. It follows directly from the V4 Flash hardware envelope.

  • Unified Memory Architecture removes the VRAM ceiling. On x86 with discrete GPUs, VRAM size is the constraint. To fit an 80-150 GB model you need an H200 (141 GB HBM3e) at five figures, or four RTX 4090s with INT4 and multi-GPU plumbing. Apple Silicon shares one large pool between CPU and GPU. A 128 GB M3 Max holds the 80.8 GB q2 GGUF, KV cache, and runtime without any PCIe round trips.
  • Memory bandwidth fits the MoE pattern. V4 Flash is sparse MoE: 284B total parameters, 13B activated per token. Each token must read its routed expert weights from memory. M3 Ultra delivers around 800 GB/s of memory bandwidth. Combined with IQ2_XXS shrinking the routed experts, it stays inside the bandwidth envelope.
  • Internal NVMe matches the on-disk KV protocol. Apple Silicon Macs ship NVMe SSDs in the 5-6 GB/s sequential class. The --kv-disk-dir protocol is designed to write KV slabs in patterns those drives handle well. On a typical desktop PC SATA SSD it would not be usable.

Net: a high-memory Mac is, for now, the only consumer-grade hardware that lands in the "fits, runs fast enough, costs less than a car" intersection. That is also why this wave of local frontier-model inference is happening on Mac, not on x86.

Seven steps from git clone to Cursor talking to V4 Flash

The path below is the shortest validated route from zero to "Cursor sends a prompt and DeepSeek V4 Flash answers it." It works on a local 128 GB MacBook Pro, on a local 512 GB Mac Studio, and on a remote Mac over SSH. The only difference for remote is one extra port-forwarding line.

  1. Pick the tier. 96-128 GB Mac → q2. 256 GB+ → q4 (or q2 for speed). MacBook Pro caps at 128 GB so always q2.
  2. Clone and build. git clone https://github.com/antirez/ds4 && cd ds4 && make. Apple Silicon defaults to Metal. Do not run make cpu; the CPU path will crash macOS.
  3. Download the model. ./download_model.sh q2 for 128 GB or ./download_model.sh q4 for 256 GB+. The script pulls from huggingface.co/antirez/deepseek-v4-gguf with curl -C - resume support. q2 is 80.8 GiB; q4 is 153.3 GiB.
  4. Start the server. ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192. Do not jump straight to --ctx 1000000; that pulls memory you may not have and triggers swap. Start at 100K and grow.
  5. Verify. In another terminal: curl http://localhost:8080/v1/models. You should see deepseek-v4-flash. A POST /v1/chat/completions with a one-line prompt confirms generation works.
  6. Connect Cursor or opencode. In Cursor's model settings, add a custom OpenAI-compatible endpoint. URL: http://<ds4-host>:8080/v1. Model: deepseek-v4-flash. API key: any string (ds4 does not validate by default).
  7. Remote Mac via SSH local forwarding. On your laptop: ssh -L 8080:localhost:8080 user@mac-host. Then point Cursor at http://localhost:8080/v1. The experience is indistinguishable from local. Full topology in the SSH local-forward to dedicated remote Mac runbook.
bash
git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2          # 96-128 GB Mac
make                             # Metal default; never make cpu

./ds4-server \
  --ctx 100000 \
  --kv-disk-dir /tmp/ds4-kv \
  --kv-disk-space-mb 8192

# From a laptop to a remote Mac running ds4-server:
ssh -L 8080:localhost:8080 dev@mac-rental.example.com

Three citable data points (from the ds4 README and Hugging Face)

  • Model spec. DeepSeek V4 Flash: 284B total parameters, 13B activated, 1M token context length, FP4 + FP8 mixed in the original release. This is the largest open-weight model that can plausibly run on a single high-memory Mac in 2026.
  • Measured throughput. MacBook Pro M3 Max 128 GB at q2 with an 11,709-token long prompt: prefill 250.11 t/s, generation 21.47 t/s. Mac Studio M3 Ultra 512 GB at q2, same prompt: prefill 468.03 t/s, generation 27.39 t/s. q4 on the Ultra at 12,018 tokens: prefill 448.82 t/s, generation 26.62 t/s.
  • GGUF on disk. q2 weighs 80.8 GiB (IQ2_XXS gate/up, Q2_K down on routed experts; Q8 on attention/shared/output; F16 on router/embed/indexer/HC/compressor; F32 on norms/sinks/bias). q4 is 153.3 GiB with Q4_K on routed experts and the rest unchanged. Optional MTP file for speculative decoding: 3.6 GiB.

Buy vs rent: a three-year TCO that pays attention to residuals

Capex stories tend to flatter the buyer because they rarely model frontier-model iteration risk or SSD wear. The table below does. Buy side assumes US retail and a 50% three-year residual, which is generous for hardware that will be a generation behind by 2027. Cloud-rent side uses MACCOME's public rate card and assumes flexible monthly use, with hourly available for short experiments. Power, networking, replacement, and physical space are excluded from the buy side; including them widens the gap.

Option Up-front Capex (USD) 3-year cost Residual at year 3 (50%) Net 3-year spend Flexibility
Buy: Mac Studio M3 Ultra 256 GB $7,000 $7,000 (depreciation absorbed) +$3,500 recovered ~$3,500 (assuming you can sell it) None: needs a new machine to move to q4 / Pro
Buy: Mac Studio M3 Ultra 512 GB top spec $14,500 $14,500 +$7,250 recovered ~$7,250 None; biggest residual risk on iteration
Rent: MACCOME 128 GB monthly $0 36 months at the public monthly rate Typically 30-50% of buying top spec, lower if you stop and go High: switch to 256 / 512 GB any month
Rent: MACCOME hourly (short experiments) $0 Pay only the hours you used Very low for a short POC Maximum: spin up, run, terminate

Reframed in plain language: buying the 512 GB Ultra is paying $14,500 today plus three years of iteration risk plus SSD wear. Renting moves all three of those risks to the platform. The math only favors buying if you are sure you will saturate the box for 18+ months. The same logic shows up on a smaller dollar scale in the Mac mini M4 buy-vs-rent decision matrix, except here every variable is multiplied by ten and the residual risk is sharper.

Closing the loop: you do not need to spend $14k to run V4 Flash

antirez has demonstrated something specific. Running a 284B-parameter frontier model locally is now a software-solved problem. 18,404 lines of C, Metal-first, on-disk KV, OpenAI- and Anthropic-compatible endpoints, an integrated coding agent. The barrier left is the 96 to 512 GB hardware wall, and for individuals it is not just expensive: it is brittle.

If you actually account for the full picture, owning a top-spec Mac Studio Ultra to run ds4 has three quiet costs. (a) A $4,000 to $14,500 one-time hit that locks cash flow and depreciates fast as the next model lands. (b) Real SSD wear from running 1M-context sessions over years. (c) Zero elasticity: today you want q2, next quarter you want q4, but you still own only one machine. For anyone running a serious AI Agent stack or a small team that wants to share a single Ultra across experiments, MACCOME's high-memory remote Macs (128 / 256 / 512 GB tiers) on monthly or hourly rent are usually the cleaner answer. The Capex becomes Opex, the residual risk moves to the platform, and the same machine can be split among teammates.

To plan node selection by region, see the 2026 multi-region Mac node guide. To pair local inference with cloud-API routing, the 2026 OpenRouter rankings and routing matrix covers the "cloud API + local inference" layer that complements ds4. Read in that order, the picture is complete: where to host the model, and how to layer it with managed APIs.

FAQ

Can ds4 run DeepSeek V4 Pro instead of Flash?

No. ds4 only supports the DeepSeek V4 Flash GGUFs from antirez/deepseek-v4-gguf (q2, q4, MTP). It is not a generic GGUF loader. V4 Pro has 1.6T total parameters and 49B activated; even a q2 quant requires 512 GB-class unified memory or multi-H100 / H200 rigs. There is currently no reproducible community path for V4 Pro under ds4. If you need to evaluate Pro short term, consider trying a top-spec Ultra over a llama.cpp multi-GPU path on the order page.

Is q2 quantization quality noticeably worse?

q2 in ds4 is asymmetric. Only routed MoE experts are aggressively quantized (IQ2_XXS for gate/up, Q2_K for down). Shared experts, attention projections, router, embed, indexer keep Q8_0 / F16 / F32. Independent 18-task tests reported behavior close to frontier APIs for tool calling and code generation. Move to q4 (256 GB+) for high-precision math reasoning or multimodal tasks.

Will network latency hurt ds4 on a remote Mac when accessed via Cursor?

ds4-server speaks OpenAI- and Anthropic-compatible HTTP. Connect via SSH local forwarding or Tailscale. Same-region RTT is typically 5-30 ms, imperceptible during streaming. Cross-region adds about 150 ms first-token latency, but the streaming flow after that is essentially identical to local. See the support center for region pairing notes.

Why not just use llama.cpp or MLX?

llama.cpp and MLX are general-purpose runtimes covering hundreds of architectures. ds4 is a Metal graph executor written specifically for V4 Flash. Hand-tuned kernels, dedicated KV handling for routed MoE, on-disk KV protocol, prompt rendering, tool calling, all targeted at one model. The trade-off is single-model focus, but on a 128 GB Mac it is the shortest path to a working V4 Flash setup. Many teams install ds4 alongside llama.cpp on the same remote Mac and route per task.