Audience: Gateway runs, but MCP tools never appear, calls time out, or Skills vanish after restart. Outcome: Keep bootstrap in the install guide and Docker production runbook; keep persistence in the volumes & Skills permissions article. This runbook covers declare → process visibility → Gateway registration → model/tool/channel triage. Layout: six pitfalls, two matrices, config sketch, six steps, three KPIs, closing guidance.
MCP is a JSON-RPC session between Gateway and a child process or remote endpoint. Config entries exist ≠ child starts; child starts ≠ schemas returned. Six frequent misreads follow.
PATH or API keys from ~/.zshrc.AGENTS.md / bootstrap text: duplicate instructions across MCP and Skills inflate context; split boundaries per the Skills tuning checklist.Run openclaw doctor using the order in the post-install doctor guide; this article adds the tool-registration evidence chain, not another install tutorial.
Keep a one-page “minimum repro card” per MCP server: one read query, one negative test that must be denied, and three expected log tokens—on-call can compare cards to spot config regressions without rereading giant prompts. Note allowed egress and data classification on the card so incidents never widen tokens without a record.
Field names vary by OpenClaw version; this table locks order of operations.
| Symptom | Collect first | Likely root | Auditable action |
|---|---|---|---|
| Empty/partial tool list | Gateway logs, child exit codes | Missing binary, cwd, permission denied | Use absolute command/args/cwd; run the child as the same user as Gateway |
| First call slow, then OK | Cold-start timing, package fetch logs | npx -y or runtime JIT | Prewarm jobs; pin versions in images; relax first-call timeout |
| Steady timeouts | Child alive, CPU, FD usage | Deadlock, blocking IO | Sample/trace where allowed; A/B with a read-only tool |
| “Tool not registered” | Schema logs, protocol version | Implementation mismatch | Align MCP versions; pin minors; read upstream changelog |
Publish a capability matrix so one workflow is not described three different ways.
| Source | Best for | Versioning | Risk |
|---|---|---|---|
| ClawHub / marketplace | Rapid experiments | Pin commit or semver range; weekly diff | Upstream drift—needs regression tests |
Repo SKILL.md / private packs | Compliance-heavy flows | Ship with mainline via PR | Maintenance load; align with MCP scope |
| MCP (system of record) | DBs, tickets, internal HTTP APIs | Independent release cadence | Over-broad tokens—maintain allowlists |
# Structural sketch only—real keys, nesting, and hot reload follow current OpenClaw docs.
# Goal: Gateway launches an MCP server over stdio as a fixed user.
#
# mcpServers:
# internal-readonly-lookup:
# command: /usr/local/bin/node
# args: ["/opt/mcp-servers/lookup/dist/index.js"]
# env:
# LOOKUP_API_TOKEN: "${LOOKUP_TOKEN_READONLY}"
#
# ClawHub Skill: extract/clone into the team skills directory, then refresh the
# skill index or run the documented reload command for your version.
Warning: MCP connects assistants to production data. Least privilege and audit trails beat “just make it work.” Split read vs write servers, split tokens, and attach allowlist snippets to the change ticket.
PATH, cwd, and bind mounts.memory_search or doc tools to curb context growth.AGENTS.md—anything >1 needs a signed waiver.On remote Macs or cloud hosts, disk and log rotation affect MCP children that spill temp files to small system volumes—timeouts may look random though the model config is unchanged. Review host ops alongside tool config.
For HTTP/SSE MCP fronts, include reverse-proxy idle timeouts, Upgrade handling, and TLS termination: Gateway may log a successful handshake while the edge proxy returns 499/504. Cross-check the Nginx/Caddy reverse-proxy guide before only raising OpenClaw timeouts.
Directional community note (not a benchmark): three heavy MCP servers plus wide retrieval often produces minute-scale queue jitter—capability matrices and allowlists beat infinite plugins for SLA.
Sleep, VPN flaps, and path drift make child processes and skill indexes unpredictable. Connecting real business data demands 24/7 uptime, persistent paths, and auditable permissions.
Self-managed boxes without multi-region choice or flexible terms encourage shared hosts where cold starts and log IO contend. Placing Gateway on dedicated Apple Silicon with predictable disks and egress—typical of a professional Mac cloud—usually makes MCP and Skills policies enforceable in contracts. MACCOME offers multi-region Mac Mini M4 / M4 Pro with flexible rental terms as a stable base for Gateway and build farms; confirm public rates and help-center SLAs before ordering.
Pilot the three checks from this runbook on a remote Mac before promoting one image fleet-wide—avoid “works locally, times out in prod” loops. If Gateway is internet-facing, ship TLS, rate limits, and IP allowlists in the same change, not as a later patch.
FAQ
How does this pair with channel onboarding?
Channel guides cover Slack/Discord/Telegram OAuth; this article covers tool discovery. If messages reach Gateway but tools fail, gather evidence from Table 1 before revisiting channel “connected but silent” cases.
What should rollback include?
Remove MCP entries, document restart order, run a read-only verification query, and confirm tool counts return to baseline on dashboards. Align billing using rental rates.
Container vs bare-metal paths differ—now what?
Maintain an absolute-path matrix per runtime; never let the model guess paths in chat. Cross-check the help center with the Docker volumes article.