2026 OpenClaw Gateway Health Checks & Rolling Updates
Docker Compose / Kubernetes Probes & Zero-Downtime Runbook

About 20 min read · MACCOME

Teams running OpenClaw Gateway on Docker or Kubernetes in 2026 often ship fast yet still treat “container running” as healthy. Without HTTP probe paths, readiness semantics, and rolling parameters on the same change ticket, you get liveness kills during cold start, depends_on that waits for containers but not readiness, or provider 429 mistaken for a dead Gateway and endless restarts. This article scopes against the Docker production runbook, upgrade and migration checklist, and provider routing and failover, and delivers six RCA-ready pitfalls, a liveness/readiness/startup matrix, Compose versus Kubernetes mapping, copy-paste YAML snippets, a six-step rollout runbook, and three dashboard metrics—plus how to place Gateway on a stable remote Mac execution plane.

From “port open” to correct semantics: six probe pitfalls

Recent OpenClaw releases add orchestration-friendly HTTP endpoints (exact paths and ports follow your pinned image tag and release notes; names such as /health, /ready, /healthz appear in the ecosystem). Log these six patterns in RCAs and reuse vocabulary from doctor and post-install triage.

  1. Liveness hits “process up, business not ready”: cold start loads routing tables or local state—too-early 200 admits traffic; too-early failure triggers kubelet kills.
  2. Readiness couples to external models: throttling may shed load cluster-wide; confirm that matches SLO versus global outage.
  3. Compose depends_on without health conditions: dependents start while Gateway still cannot reach a backend socket—intermittent 502.
  4. localhost probes differ from Service paths: 127.0.0.1 works inside the pod but ClusterIP fails—misread as app failure.
  5. Aggressive maxUnavailable: old pods drain before new pods pass readiness—short full-red windows.
  6. Log triage mixes layers: TLS termination, proxy timeouts, and Gateway errors get merged—probes tighten blindly.

Compared with the cross-platform install guide: install answers first boot; production answers long-lived ops; this article answers how orchestrators decide healthy; upgrades answer image moves and rollback.

Table 1: liveness, readiness, and startup probes—how to split responsibilities

Kubernetes probe types do not map 1:1 to Docker healthcheck restart semantics; use the table in architecture reviews.

CheckTypical failure effectValidatesOpenClaw-oriented guidance
startupProbeSuppresses liveness failures until successSlow but bounded cold startUse when first config fetch, indexes, or dependencies take minutes
livenessProbeRestart container/PodDeadlocks, unresponsive processAvoid external LLM dependencies; minimal self-check only
readinessProbeRemove from Service endpointsNot ready for trafficMay include minimal model ping or config-loaded signal—align with failover policy
Docker healthcheckMarks unhealthy; restart policy variesSingle-host ComposePair with depends_on: condition: service_healthy (syntax per Compose v2 docs)

Table 2: Expressing the same health requirement in Compose versus Kubernetes

Translating “healthy” into concrete fields cuts midnight debate.

DimensionDocker Compose (pattern)Kubernetes Deployment
Probe commandhealthcheck.test with curl/wgethttpGet or exec
Startup gracestart_periodstartupProbe or larger initialDelaySeconds
Traffic sheddingProxy/LB layer or health label onlyreadinessProbe controls Endpoints
RollingManual compose order or external CDmaxSurge / maxUnavailable / minReadySeconds
yaml
# Examples—replace PORT and paths with values from docs for your tag
# Docker Compose (excerpt)
healthcheck:
  test: ["CMD-SHELL", "curl -fsS http://127.0.0.1:${GATEWAY_PORT}/health || exit 1"]
  interval: 15s
  timeout: 3s
  retries: 5
  start_period: 120s

# Kubernetes (excerpt)
readinessProbe:
  httpGet:
    path: /ready
    port: http
  initialDelaySeconds: 10
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 30
  periodSeconds: 20
warning

Warning: Upstream may add or rename /health, /ready, /healthz across 2026.3.x-style releases. Before copying snippets, confirm official docs for your digest/tag and verify with curl -v in staging.

Six-step rollout runbook: from curl 200 to a rollback-friendly rolling update

  1. Pin the version anchor: record image digest or minor tag plus documented health paths and ports.
  2. Validate inside the container: probe 127.0.0.1 via docker compose exec or kubectl exec, then validate via Service.
  3. Readiness before liveness: stabilize readiness and startup, then tighten liveness to avoid startup kills.
  4. Tune rolling parameters: keep at least one serving replica at all times; document maxUnavailable versus the maintenance window.
  5. Align provider failover: document expected behavior on external model failure—shed, degrade model, or alert only—per the provider article.
  6. Practice rollback: follow the upgrade checklist to retag images and restore volumes; confirm probes still match the older build.

Three hard metrics for dashboards and on-call

  1. Probe triplet: HTTP code, latency percentile, consecutive failures—show beside Ingress 502 rate.
  2. Ready replicas / desired during deploys: alert on how long readiness stays below 100%; frequent small releases compress that window.
  3. External dependency error share: 429/5xx from providers versus Gateway-internal errors—share fields with the provider routing article.

On the Linux systemd + Tunnel path, align tunnel health, loopback listeners, and upstream LB checks—otherwise you can see “tunnel alive, Gateway not listening” false positives.

Correlate kubectl rollout status or compose upgrade logs with Git changes to separate tight probes from image regressions.

Why consumer NAS or spare laptops struggle with production-grade probe semantics

Consumer gear fights sleep, disk jitter, and unscheduled OS updates—startup time and probe thresholds drift. Combined with rolling windows, that burns on-call hours. Running OpenClaw and agents under an expected SLA needs dedicated compute, stable egress, and burst-friendly nodes.

Fragmented self-hosting also makes multi-region latency and contract ops harder: probe tuning plus host reboot coupling is painful on laptops. For 24/7 observable, rollable, rollback-friendly Gateways, professional multi-region Apple Silicon cloud Macs usually beat ad-hoc hardware. MACCOME offers Mac Mini M4 / M4 Pro bare-metal with flexible terms as a Gateway or mixed automation host—start with the help center for access language, then rental rates and the multi-region guide to finalize SKUs.

Pilot: short-term rent in your target region, run container probes, Service probes, and one full rolling exercise before locking monthly or quarterly terms.

FAQ

Probes fail but the UI opens—which source wins?

Orchestrator-configured URLs and status codes. For billing context open rental rates; for probes reproduce with in-container curl in staging.

How do I use this with the Docker production article?

Production covers volumes and tokens; this covers probes and rollouts. Attach both plus upgrades to the same change.

Should 429 hit liveness?

Generally no—use provider routing and failover for backoff and routing; readiness coupling is an explicit SLO choice.