Skip to content

WhatsApp user-proxy frontend via Baileys bridge (linked-device, nanobot-style) #436

@rockfordlhotka

Description

@rockfordlhotka

Summary

Add a WhatsApp IUserFrontend implementation so RockBot can act as a conversational agent over WhatsApp for a single authorized user, using the WhatsApp Web (linked-device) protocol via Baileys — the same approach nanobot uses.

Sibling to #295 (Discord user-proxy). Replaces the previously-filed #435 (which sketched Meta's official Cloud API — wrong approach for this use case).

Motivation

A colleague asked for WhatsApp support that behaves "like nanobot or OpenClaw do today." Investigating nanobot's implementation revealed they use a Node.js bridge running @whiskeysockets/baileys, with QR-code linked-device auth against the user's personal WhatsApp number. No Meta business account, no message templates, no 24-hour reply window — and groups, media, and voice "just work."

The Cloud API path (#435) was the wrong shape for this ask. It requires a separate business phone number, business verification, and pre-approved templates for any agent-initiated message older than 24h. Useful for a different use case, not this one.

Trade-offs (explicit, eyes-open)

Pros

  • Free; no Meta Business setup, no verification, no per-message billing
  • Personal phone number is the bot's identity — the user sees WhatsApp messages from "themselves" or whatever number they linked
  • Full bidirectional chat including DMs, groups, media, voice (with transcription)
  • No 24-hour window / message-template constraint
  • Architecturally clean: bridge subprocess isolates the protocol library from the agent core

Cons (these are real — accept them or don't build this)

  • Against WhatsApp's Terms of Service. Accounts using Baileys have been banned. Risk is unpredictable but non-zero
  • Linked-device sessions can be revoked by WhatsApp at any time, forcing re-scan
  • Baileys is reverse-engineered — protocol changes upstream can break the bridge until Baileys catches up
  • Adds a Node.js >=20 runtime dependency to the userproxy container
  • Same out-of-scope rationale that excluded Signal from Discord user-proxy frontend for private-channel real-time chat/notifications #295 ("signal-cli is unofficial and fragile") arguably applies here. Decide whether the value justifies the relaxation

Design

Two-process architecture mirroring nanobot:

bridge/ — Node.js subprocess (new)

  • Standalone TypeScript project, builds to a static artifact bundled with the userproxy image
  • Owns @whiskeysockets/baileys, the WhatsApp Web protocol, and the linked-device session state on disk
  • Exposes a local-only WebSocket server (loopback bind, shared-secret auth token)
  • Simple JSON protocol — both directions:
    • Outbound {type:"send", to, text}, {type:"send_media", to, filePath, mimetype, fileName}
    • Inbound {type:"message", pn, sender, content, id, isGroup, wasMentioned, media[], timestamp}
    • Lifecycle: {type:"auth", token}, {type:"status", status}, {type:"qr", ...}, {type:"error", ...}
  • On first run: prints QR to stdout for the user to scan with WhatsApp → Settings → Linked Devices

src/RockBot.UserProxy.WhatsApp/ — .NET frontend (new)

Implements IUserFrontend:

  • Connects to the local bridge over WebSocket on startup, sends {type:"auth", token: <shared secret>}
  • Outbound: DisplayReplyAsync → JSON send frame to the bridge
  • Inbound: listens for message frames, applies the sender allowlist, publishes UserMessage to user.message on the bus — same path CLI/Blazor frontends use today
  • Handles both @s.whatsapp.net (legacy phone) and @lid.whatsapp.net (new LID) identity formats, with LID→phone mapping like nanobot does
  • 1000-entry LRU for message-ID dedup
  • Auto-reconnect with backoff if the bridge connection drops

Configuration (WhatsAppOptions)

  • BridgeUrl (default ws://localhost:3001) — bridge WebSocket endpoint
  • BridgeToken — shared secret for bridge auth (generated on first run if blank, persisted to a config file mounted from a Kubernetes Secret)
  • AuthorizedPhoneNumbers — E.164 allowlist (e.g. ["+1234567890"]); messages from anything else dropped pre-bus
  • GroupPolicyOpen (respond to all group messages) or Mention (only when @mentioned). Direct messages always respond if sender is allowed
  • MediaDownloadPath — where the bridge writes inbound media (PVC mount)

Least-privilege / "nothing trusts the LLM" alignment

  • Bridge subprocess is the only component with the Baileys dependency and the linked-device session — minimal Baileys blast radius
  • Bridge binds loopback-only and requires a token; .NET frontend speaks only JSON to it
  • Sender filtering happens in the .NET frontend before bus publish — unauthorized numbers never produce a UserMessage
  • Outbound bus events carry source: whatsapp + principal: <phone> (mirrors Discord user-proxy frontend for private-channel real-time chat/notifications #295)
  • If the bridge crashes or gets revoked by WhatsApp, blast radius is one process and one linked-device session

Deployment

  • New Helm subchart in deploy/helm/rockbot/, runs as its own pod — single pod, two containers:
    • bridge — Node.js container running the WhatsApp bridge
    • frontend — .NET container running RockBot.UserProxy.WhatsApp
    • Bridge and frontend communicate over loopback inside the pod
  • New Dockerfile per container: deploy/Dockerfile.whatsapp-bridge, deploy/Dockerfile.userproxy-whatsapp
  • PVC required for bridge session state (/data/whatsapp-auth) so a pod restart doesn't force re-scanning the QR. Same PVC also hosts inbound media
  • No public ingress needed — bridge dials WhatsApp servers outbound; no webhook
  • First-run QR flow:
    • Bridge container starts and prints the QR to its stdout (kubectl logs)
    • User opens WhatsApp → Settings → Linked Devices → scans QR
    • Session persists to PVC; subsequent restarts skip the QR step

Out of scope (future issues)

  • Multi-user — single authorized phone number only; multi-tenant needs per-principal isolation
  • Message templates / Cloud API — see closed WhatsApp user-proxy frontend for chat (Cloud API, user-initiated) #435 if proactive >24h alerts ever become wanted
  • Rich features — read receipts, typing indicators, reactions, interactive list/button messages
  • Outbound voice / audio synthesis — text and media only

Acceptance criteria

  • New Node.js bridge/ project using @whiskeysockets/baileys, building to a deployable artifact
  • Local-only WebSocket server in the bridge with shared-token auth
  • New RockBot.UserProxy.WhatsApp project implementing IUserFrontend
  • Outbound: agent replies sent via bridge send / send_media frames
  • Inbound: bridge message frames forwarded to user.message after allowlist filter
  • Messages from unauthorized phone numbers are dropped (never reach the bus)
  • Bus events tagged with source: whatsapp + principal: <phone> metadata
  • LID and legacy phone JID formats both handled; LID→phone mapping cached
  • Message-ID dedup so re-delivered frames don't double-process
  • Voice messages transcribed before reaching the agent (via existing transcription path if one exists, otherwise document the gap)
  • Auto-reconnect to the bridge with backoff
  • Unit tests for the .NET frontend (mock WebSocket / bridge)
  • Dockerfile + Helm wiring including PVC for whatsapp-auth/
  • README documenting linked-device setup (QR scan), ToS risk acknowledgement, secret configuration, group policy

Open questions

  • Bridge implementation language — Node.js (matches nanobot, Baileys is the reference implementation) vs. trying to port to a .NET WhatsApp Web client. Node is the pragmatic answer; the bridge is small and contained
  • Voice transcription — does RockBot already have a transcription path the bridge could call out to, or do we add one? Nanobot transcribes server-side before invoking the agent
  • Single pod vs. two pods — sidecar pattern (one pod, two containers) keeps the loopback boundary trivial; two pods would expose the bridge token over the pod network. Lean sidecar
  • Session loss handling — when WhatsApp revokes the linked device, the bridge needs to surface that loudly. Should this fail the pod's readiness probe so the user notices, or silently log and serve an error via DisplayErrorAsync? Probably readiness-fail
  • ToS acceptance — should this require an explicit consentAcknowledged: true config flag (like nanobot's email channel) before the bridge will start, forcing the deployer to confirm they understand the ban risk?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions