Teams that already run OpenClaw from install or Docker/Compose in 2026 often fail on wrong model routes, mixed 429/timeouts, inconsistent failover order, and split-brain env vars between npm global and containers—not on “cannot install.” This article scopes against cross-platform install, Docker production, and upgrade & migration: it focuses on runtime multi-model routing, executable failover, dual-path tables, and symptom-based Gateway/CLI log triage. For post-install symptoms continue with doctor triage.
When default and fallback models and different provider rate limits sit behind one Gateway, failures look random. Map these six classes to alert fields—do not stop at HTTP status alone.
export without container injection, or compose overrides opposite to intent.These pains are orthogonal to upgrade backups and image tags: runtime routing vs change control; read both to separate release from pager duty.
Multi-model usually means multiple billing accounts and compliance boundaries. If sessions are not explicitly scoped to models, you risk overspend or policy violations—treat the route table as a cost and permissions contract reviewed with Secrets governance.
“Endpoint reachable” is not “chain healthy”: proxies, firewalls, and DNS may split success per session—structured logs and sampling beat a single global error rate.
Document config load order, env precedence, and restart boundaries for both paths or you will see “host changed, container did not.”
| Dimension | npm global / local process | Docker / Compose |
|---|---|---|
| Config & secrets | User config files and shell env dominate | env_file, mounts, runtime -e must be explicit |
| Upgrade & rollback | npm package pins with global CLI | Image tags, volumes, docker compose pull order per upgrade guide |
| Health checks | Align with systemd/launchd probes | In-container curl/CLI; network stack differs from host (incl. loopback policy) |
| Common mistakes | Multiple Node versions pick the wrong global | Read-only mounts expected to hot reload; env lost after rebuild |
Fix org-wide rules for when to swap model vs key vs egress and write them into the same SLO doc. Lower numbers are earlier attempts.
| Symptom (logs/metrics) | Likely cause | Example order |
|---|---|---|
| HTTP 429 or explicit rate limit | Quota or concurrency | Backoff → spare key → lower concurrency → temporary fallback model |
| Timeouts, resets, slow TLS | Network path or region egress | Increase timeout (capped) → proxy/DNS → closer egress |
| Model missing / not entitled | ID or account permission | Check provider console → fix route table → avoid silent unrelated fallback |
| Partial session success | Key imbalance or sticky routing errors | Per-key counters & circuit break → session pinning → Gateway sharding |
# Minimum log fields per request (example): # requestId / sessionId / provider / modelId / status / latencyMs # If any is missing, add observability before changing routes blindly
Warning: When downgrading to a smaller or cheaper model, label capability gaps in downstream automation or review steps—silent “dumber” outputs cause business incidents.
In 2026, provider catalogs still churn—config as documentation beats tribal knowledge; store route tables and alert thresholds in the same repo to reduce handoff gaps.
If Gateway runs in APAC and North America, cross a heatmap of region × provider: regional degradation often precedes global red and informs burst rental signals.
Decompose each user journey: auth → routing → model call → tool side effects → session writeback. Each stage should share a requestId; if not, add tracing before tuning models.
For hybrid setups (laptop, bare server, container), run a weekly minimal parity test: same prompt and route version on all three paths; freeze releases if latency/error spread crosses threshold.
Personal devices add sleep, flaky WAN, and unaudited env vars that turn routing bugs into intermittent ghosts. When CI, paging, or customer SLAs bind, you need dedicated compute, stable egress, and contractable rental terms—not endless hosts file edits.
For 24/7 Gateway, batch automation, or lower latency next to build/signing hosts, placing execution on professional multi-region Mac cloud is usually easier to observe and audit. MACCOME offers Mac Mini M4 / M4 Pro bare-metal across regions with flexible terms—pair with the multi-region guide and rental rates.
Pilot in one region until routes and log fields are stable, then decide whether to co-locate Gateway with workloads to avoid cross-region inference plus throttling.
If you also use advanced channels from the advanced runbook, ship model routing changes separately from channel config changes to limit blast radius; attach the route table version to the change ticket for log sampling and audits.
FAQ
How is this different from the upgrade and migration guide?
Upgrades cover backups and rollback; this covers runtime routing and dual-path logs. For triage see doctor triage; commercial terms in rental rates.
Docker shows a new model name but traffic is old—what first?
Check compose volumes and env overrides, then container-loaded config and Gateway logs; pair with Docker production health checks.
How to plan OpenClaw with a dedicated remote Mac?
Review SSH/VNC and placement together: SSH vs VNC and the Help Center.