2026 OpenClaw with Ollama/vLLM on a Remote Mac: Ports, Health Checks, Resource Contention, and Start/Stop Runbook

~16 min read · MACCOME

On a leased or self-managed remote Mac, running OpenClaw Gateway on the same machine as Ollama or local vLLM rarely fails because «the model string is wrong». It fails when ports, probe order, unified memory and CPU contention, and start/stop hygiene stack together. This article splits work with the offline model triage guide: that post owns API bridges, context limits, and no-reply flows; this one is the same-box topology runbook. You should leave knowing how to separate listeners, which layer to verify before healthz, and why you downgrade the model before you thrash Gateway when everything is pegged.

Five «false failures» you see when everything shares one Mac

  1. Port collisions: Ollama defaults to 11434, common vLLM to 8000, OpenClaw Control UI often 18789; another local reverse proxy or sidecar is the usual double-bind.
  2. Probes in the wrong order: Gateway returns 200 while the provider still points at a cold or broken inference endpoint, so the UI loads but the chat prints nothing.
  3. Unified memory monopolized: model weights plus Metal or ANE plus Node buffers can make the box «feel slow» with misleadingly calm CPU charts.
  4. Shared I/O and thermals: long runs pin fans and thermal limits; Xcode-sized builds and a 24/7 agent on one host show up as jitter.
  5. Upgrade races: bump Gateway before the model, or the reverse, and your saved base URL, token, and real listeners diverge. Docker-specific steps live in the official Docker and Control UI guide.

Ports and ownership: paste this table into the change ticket

Put the table in the review so «we all meant port 8000» does not turn into «something else took 8000 in week three».

Component Typical default Touchpoint with OpenClaw
Ollama HTTP on 127.0.0.1:11434 by default; open the LAN only with an explicit bind and firewall story Provider baseURL to the OpenAI-compatible surface; avoid blind reverse proxies with no upstream health
vLLM (local) Often 8000 or custom; multiple instances need disjoint ports and GPU or thread pools Same as Ollama: prove /v1/models and a minimal completion before Gateway references it
OpenClaw Gateway Control UI often 18789; follow your real openclaw config healthz / readyz first, provider second; see Gateway and model triage

Six steps: an order you can sign

  1. Write a resource budget line: max memory and concurrent inference for the model process, plus headroom for Gateway and Node; on M4-class hosts keep roughly 10–20% of RAM free for the OS and disk cache (tune to your model).
  2. Start inference, then wire the provider: wait for Ollama or vLLM to listen, run a minimal chat/completions with curl, then start or reload Gateway so «not ready» never becomes sticky state.
  3. Lock ports and loopback: if only local callers exist, bind to 127.0.0.1; if a container must reach the host, document bridge rules and who owns the firewall.
  4. Two-layer probes: layer one is inference health (HTTP plus a tiny generation), layer two is Gateway healthz, layer three is an end-to-end chat probe; if any layer is red, keep production traffic away.
  5. Hot-change discipline: when upgrading weights or images, drain conversations, stop Gateway ingress or go read-only if you have that, swap the model, bring Gateway back, and rerun both probe layers.
  6. Rollback: keep the previous weight path and the previous provider block; when several things move, roll back inference first, then Gateway. For doctor semantics across platforms, see post-install doctor notes for the sections that apply to macOS hosts.
bash
# Minimal probe order (rewrite host/port)
curl -sS "http://127.0.0.1:11434/api/tags" > /dev/null   # Ollama alive
# curl -sS "http://127.0.0.1:8000/v1/models" > /dev/null  # vLLM
curl -fsS "http://127.0.0.1:18789/healthz"                 # Gateway
# Then one short chat completion or openclaw doctor—whatever your install documents

Three numbers that belong in the on-call guide (replace with measured values)

  • Time-to-first-token and queue depth: log P95 after a cold load and when hot; if hot latency is still several seconds while CPU looks idle, check unified memory pressure and paging before raising concurrency.
  • Dual load on one Mac: when large Xcode or monorepo builds fight a 24/7 agent, you usually see OOM or glacial token rates; time-slice, queue, or move builds to a dedicated builder before you chase model names.
  • Keep-alives and timeouts: for long streams, if either hop toward inference or Gateway toward clients is too tight, you get mid-stream drops; change both sides together and record the change ID.

Why ad-hoc laptops and unmanaged shared hosts lose this topology

Sleep/wake cycles, residential uplinks, or unpredictable neighbors turn reproducible start/stop and probing into probability. Co-hosted Ollama plus Gateway needs a stable thermal envelope and predictable I/O. Moving the agent and inference into a dedicated, 24/7, contract-backed memory and disk profile often beats endless tuning. For production-grade runbooks, MACCOME cloud Macs pair dedicated Apple Silicon with lease models you can put in a ledger—so you fight over resource tables and change control, not luck.

What not to do in week one after go-live

Do not hammer Gateway concurrency before you have a known-good single completion on the provider. Do not bump Gateway before you confirm nothing else owns the inference port. Two clean cuts beat ten pages of tribal knowledge.

FAQ

How does this split work with the offline Ollama/vLLM article?

That article covers API bridging, context, and no-reply triage. This one covers same-machine ports, probes, resources, and start/stop. Pair offline private model triage with Gateway triage.

Gateway in Docker with Ollama on the host—is that «co-hosted»?

Same logical machine, explicit networking beyond the port list: host.docker.internal or bridge IPs, plus verifiable firewall rules. Start from Docker production deployment and the official Docker guide.