Probes fail but the Control UI loads—which source of truth wins?

The URL, method, and status code your orchestrator actually checks. A reachable UI does not imply Gateway readiness. Match probes to docs for your digest/tag and reproduce with curl inside the container.

How does this pair with the Docker production runbook?

Production covers Compose longevity, volumes, tokens, and rollback order. This article covers HTTP probe semantics, readiness meaning, rolling strategy, and avoiding accidental pod kills—use both on the same change ticket.

Should provider HTTP 429 fail liveness?

Usually no—liveness should not restart on external LLM outages. Model or egress failures belong in readiness or business-level checks, aligned with the provider routing and failover article.

2026 OpenClaw Gateway Health Checks & Rolling Updates: Docker Compose / Kubernetes Probes & Zero-Downtime Runbook

About 20 min read · MACCOME

Teams running OpenClaw Gateway on Docker or Kubernetes in 2026 often ship fast yet still treat “container running” as healthy. Without HTTP probe paths, readiness semantics, and rolling parameters on the same change ticket, you get liveness kills during cold start, depends_on that waits for containers but not readiness, or provider 429 mistaken for a dead Gateway and endless restarts. This article scopes against the Docker production runbook, upgrade and migration checklist, and provider routing and failover, and delivers six RCA-ready pitfalls, a liveness/readiness/startup matrix, Compose versus Kubernetes mapping, copy-paste YAML snippets, a six-step rollout runbook, and three dashboard metrics—plus how to place Gateway on a stable remote Mac execution plane.

From “port open” to correct semantics: six probe pitfalls

Recent OpenClaw releases add orchestration-friendly HTTP endpoints (exact paths and ports follow your pinned image tag and release notes; names such as /health, /ready, /healthz appear in the ecosystem). Log these six patterns in RCAs and reuse vocabulary from doctor and post-install triage.

Liveness hits “process up, business not ready”: cold start loads routing tables or local state—too-early 200 admits traffic; too-early failure triggers kubelet kills.
Readiness couples to external models: throttling may shed load cluster-wide; confirm that matches SLO versus global outage.
Compose depends_on without health conditions: dependents start while Gateway still cannot reach a backend socket—intermittent 502.
localhost probes differ from Service paths: 127.0.0.1 works inside the pod but ClusterIP fails—misread as app failure.
Aggressive maxUnavailable: old pods drain before new pods pass readiness—short full-red windows.
Log triage mixes layers: TLS termination, proxy timeouts, and Gateway errors get merged—probes tighten blindly.

Compared with the cross-platform install guide: install answers first boot; production answers long-lived ops; this article answers how orchestrators decide healthy; upgrades answer image moves and rollback.

Table 1: liveness, readiness, and startup probes—how to split responsibilities

Kubernetes probe types do not map 1:1 to Docker healthcheck restart semantics; use the table in architecture reviews.

Check	Typical failure effect	Validates	OpenClaw-oriented guidance
startupProbe	Suppresses liveness failures until success	Slow but bounded cold start	Use when first config fetch, indexes, or dependencies take minutes
livenessProbe	Restart container/Pod	Deadlocks, unresponsive process	Avoid external LLM dependencies; minimal self-check only
readinessProbe	Remove from Service endpoints	Not ready for traffic	May include minimal model ping or config-loaded signal—align with failover policy
Docker healthcheck	Marks unhealthy; restart policy varies	Single-host Compose	Pair with `depends_on: condition: service_healthy` (syntax per Compose v2 docs)

Table 2: Expressing the same health requirement in Compose versus Kubernetes

Translating “healthy” into concrete fields cuts midnight debate.

Dimension	Docker Compose (pattern)	Kubernetes Deployment
Probe command	`healthcheck.test` with curl/wget	`httpGet` or `exec`
Startup grace	`start_period`	`startupProbe` or larger `initialDelaySeconds`
Traffic shedding	Proxy/LB layer or health label only	`readinessProbe` controls Endpoints
Rolling	Manual compose order or external CD	`maxSurge` / `maxUnavailable` / `minReadySeconds`

yaml

# Examples—replace PORT and paths with values from docs for your tag
# Docker Compose (excerpt)
healthcheck:
  test: ["CMD-SHELL", "curl -fsS http://127.0.0.1:${GATEWAY_PORT}/health || exit 1"]
  interval: 15s
  timeout: 3s
  retries: 5
  start_period: 120s

# Kubernetes (excerpt)
readinessProbe:
  httpGet:
    path: /ready
    port: http
  initialDelaySeconds: 10
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: http
  initialDelaySeconds: 30
  periodSeconds: 20

warning

Warning: Upstream may add or rename /health, /ready, /healthz across 2026.3.x-style releases. Before copying snippets, confirm official docs for your digest/tag and verify with curl -v in staging.

Six-step rollout runbook: from curl 200 to a rollback-friendly rolling update

Pin the version anchor: record image digest or minor tag plus documented health paths and ports.
Validate inside the container: probe 127.0.0.1 via docker compose exec or kubectl exec, then validate via Service.
Readiness before liveness: stabilize readiness and startup, then tighten liveness to avoid startup kills.
Tune rolling parameters: keep at least one serving replica at all times; document maxUnavailable versus the maintenance window.
Align provider failover: document expected behavior on external model failure—shed, degrade model, or alert only—per the provider article.
Practice rollback: follow the upgrade checklist to retag images and restore volumes; confirm probes still match the older build.

Three hard metrics for dashboards and on-call

Probe triplet: HTTP code, latency percentile, consecutive failures—show beside Ingress 502 rate.
Ready replicas / desired during deploys: alert on how long readiness stays below 100%; frequent small releases compress that window.
External dependency error share: 429/5xx from providers versus Gateway-internal errors—share fields with the provider routing article.

On the Linux systemd + Tunnel path, align tunnel health, loopback listeners, and upstream LB checks—otherwise you can see “tunnel alive, Gateway not listening” false positives.

Correlate kubectl rollout status or compose upgrade logs with Git changes to separate tight probes from image regressions.

Why consumer NAS or spare laptops struggle with production-grade probe semantics

Consumer gear fights sleep, disk jitter, and unscheduled OS updates—startup time and probe thresholds drift. Combined with rolling windows, that burns on-call hours. Running OpenClaw and agents under an expected SLA needs dedicated compute, stable egress, and burst-friendly nodes.

Fragmented self-hosting also makes multi-region latency and contract ops harder: probe tuning plus host reboot coupling is painful on laptops. For 24/7 observable, rollable, rollback-friendly Gateways, professional multi-region Apple Silicon cloud Macs usually beat ad-hoc hardware. MACCOME offers Mac Mini M4 / M4 Pro bare-metal with flexible terms as a Gateway or mixed automation host—start with the help center for access language, then rental rates and the multi-region guide to finalize SKUs.

Pilot: short-term rent in your target region, run container probes, Service probes, and one full rolling exercise before locking monthly or quarterly terms.

FAQ

Probes fail but the UI opens—which source wins?

Orchestrator-configured URLs and status codes. For billing context open rental rates; for probes reproduce with in-container curl in staging.

How do I use this with the Docker production article?

Production covers volumes and tokens; this covers probes and rollouts. Attach both plus upgrades to the same change.

Should 429 hit liveness?

Generally no—use provider routing and failover for backoff and routing; readiness coupling is an explicit SLO choice.

2026 OpenClaw Gateway Health Checks & Rolling Updates Docker Compose / Kubernetes Probes & Zero-Downtime Runbook

From “port open” to correct semantics: six probe pitfalls

Table 1: liveness, readiness, and startup probes—how to split responsibilities

Table 2: Expressing the same health requirement in Compose versus Kubernetes

Six-step rollout runbook: from curl 200 to a rollback-friendly rolling update

Three hard metrics for dashboards and on-call

Why consumer NAS or spare laptops struggle with production-grade probe semantics

2026 OpenClaw Gateway Health Checks & Rolling Updates
Docker Compose / Kubernetes Probes & Zero-Downtime Runbook