Aleph
Architecture

Cluster Federation

Single-center asymmetric node federation — one brain orchestrates many machines while the cluster keeps exactly one mind. Reverse RPC, node_invoke, command allowlists, and approval routing.

"One core, many shells" gives one brain many I/O channels. Cluster Federation extends the same principle to one brain, many execution bodies: a single center aleph-server orchestrates work across several physical machines (nodes) — while preserving the hard rule that the cluster has exactly one mind.

This is the "one core, many bodies" extension of Single-Core Multi-Terminal. The center is still the only reasoning core; nodes are pure execution arms.

For how a single shell attaches to a single core (local or remote), see Desktop Bridge. This page is about the orthogonal axis: how the center commands many nodes.


Topology: Single-Center, Asymmetric

Aleph deliberately rejects symmetric mesh, multi-master, and distributed consensus. The cluster is one center + N nodes:

        [Human / Aleph Channel]              ← attaches to the single front door (center),
              │ JSON-RPC over WS                local or remote, mutually exclusive

   ╔══════════════════════════╗
   ║   Center Aleph Core       ║   The LLM sees exactly TWO cluster tools:
   ║   (the only brain)        ║     environments.list      (read)
   ║   Think → Act loop        ║     node_invoke(node, cmd, params)  (write)
   ║                           ║
   ║  ┌────────────────────┐   ║   NodeRegistry: nodes_by_id / nodes_by_conn
   ║  │ src/cluster/        │   ║   pending_invokes: reverse-RPC id correlation
   ║  │  NodeRegistry        │   ║
   ║  │  env aggregation +   │   ║
   ║  │  routing             │   ║
   ║  └─────────┬──────────┘   ║
   ╚════════════│═════════════╝
   center→node  │ node.invoke         node→center ▲ events
   (reverse RPC)│ (over the node's    (push back)  │
                │  always-open WS)
        ┌───────┴───────┬───────────────────┐
        ▼               ▼                   ▼
   ┌─────────┐     ┌─────────┐        ┌──────────┐
   │ Node B   │    │ Node C   │        │  local   │
   │NodeClient│    │NodeClient│        │ (the host│
   │ dials →  │    │ dials →  │        │  is also │
   │  center  │    │  center  │        │  an env) │
   │ declares │    │ declares │        └──────────┘
   │ commands │    │ commands │
   │ executes │    │ executes │
   │ locally  │    │ locally  │
   └─────────┘     └─────────┘
   node = pure execution arm: receives node.invoke → runs a local
   tool/agent → returns result + events
DimensionDecision
TopologySingle-center asymmetric: 1 center (the only brain, holds the NodeRegistry + orchestration) + N nodes (pure execution arms that dial in).
Front doorThe center is the only entry point. Connecting to a node ≠ using the cluster — a node cannot see or orchestrate the cluster. Operating the cluster remotely means remotely connecting a channel to the center.
Dial directionNodes dial out to the center (NAT-friendly), not the other way around.
IdentityIdentity = (the core you connect to) × (the tier that core grants your device). Reuses the existing pairing + operator/guest + chat/config model. No new identity system.
MemoryCenter memory = cluster memory. No distributed shared memory. A node's sub-agent uses the node's local memory; results flow back to the center.
Dual roleA machine is exactly one of: standalone / center / node. No machine is simultaneously a center and someone else's node.

Not a high-availability cluster. The single center is a single point — fine for a personal assistant spanning a few machines. There is no failover, no multi-center, no consensus.


The LLM sees two tools, not N

The center LLM never sees a growing tool surface as nodes join. It always sees the same two cluster primitives:

  • environments.list — self-describing. Each online node reports its id, status, and command catalog (names + JSON Schema). The host itself is an environment with id: "local".
  • node_invoke(node_id, command, params) — the universal execution entry point.

The model reads environments.list and assembles the invoke itself. Which machine, which command, which params — all of it is reasoning that lives in the prompt, not in deterministic routing code. This is LLM Sovereignty applied to placement: "on which machine should this run?" is just one more parameter the model fills in.

Four capability classes collapse into one transport + one command catalog:

CapabilityExpressed as
Tool executionnode_invoke("node:B", "bash", { ... })
Capability accessnode_invoke("node:B", "desktop.screenshot", { ... })
Sub-agent delegationnode_invoke("node:B", "agent.run", { task }) — B runs Think→Act locally, streams progress back
Event reportingnode → center event channel (reverse); the center subscribes and the LLM reacts proactively

A dedicated node_file(node, direction, local_path, remote_path) tool moves binaries and large files between center and node (push/pull, SHA-256 verified on both ends). Bytes flow process-to-process and never enter the LLM context — only paths go in, and a { direction, bytes, sha256, paths } summary comes back.


Node lifecycle

The whole lifecycle reuses the existing pairing / token / tool machinery — there is no parallel enrollment system.

  1. Enroll. On node B, the operator says (in natural language) "join this machine to <center URL>." B's aleph-server node arm dials out to the center and runs the interactive pairing flow: the center issues a 6-digit code, B prints it to stdout, and the center operator approves it from the Panel (the same approval surface used for cold-browser pairing). On approval the center mints a DeviceRole::Node token, which B persists at ~/.aleph/node/<name>.json (mode 0600).
  2. Declare. Once connected, B reports the command catalog it is willing to expose (names + JSON Schema), sourced from B's local tools/skills. Default deny, explicit allowlist — the center can only call what B approved.
  3. Invoke. The center LLM calls node_invoke("node:B", "bash", { ... }) → the node_invoke tool → NodeRegistry finds B's session → the call is delivered over the always-open WS as a reverse-RPC tool.call frame → B's NodeClient dispatches it to a local tool → the result returns by id → the pending table wakes → control returns to the LLM. Long tasks stream back over the event channel.
  4. Perceive. B's daemon events and sub-agent progress push back over the same WS. The center routes them by topic to subscribers — Panel rendering, and the center LLM reacting proactively.

Reverse RPC: the core new mechanism

A plain Gateway connection is client→server request/response plus server→client one-way notifications (the event bus). The cluster needs the missing piece: server→center can issue an id-correlated request down to a connected node and await its response.

center                                   node B (NodeClient)
  │  node_invoke("node:B","bash",{…})        │
  │                                          │
  ├─ pending.insert((conn,id), tx) ──────────┤
  │  tool.call { method, params, id } ──────▶ │  dispatch to local tool
  │                                          │  (spawned, never blocks
  │                                          │   the read loop)
  │  ◀────────── { result, id } ─────────────┤
  ├─ pending.remove((conn,id)) → tx.send ────┤
  │  result returns to the LLM               │
  • Pending table: DashMap<(conn_id, req_id), oneshot::Sender<…>> correlates each outgoing request with the node's eventual reply.
  • No head-of-line blocking on the node: the node dispatches each tool.call on its own task, so its read loop is always free to receive the next frame (and the approval responses described below).
  • Liveness: when a node disconnects, every in-flight call against it is cancelled immediately — the pending senders are drained so callers get a fast Cancelled error instead of waiting out the timeout. The center emits node.connected / node.disconnected events (mirroring presence join/leave) so subscribers and the LLM learn about topology changes without polling.

Approval routing: nodes can ask the center

A node runs headless — it has no operator sitting at it. So when a node's sandbox hits a command that needs a capability upgrade, it doesn't decide locally; it routes the approval request back up to the center, where a human operator decides.

node B sandbox hits a privileged command

  │  node.approval.request  (reverse, over B's WS)

center  →  ExecApprovalManager  →  Panel approval card
              (the SAME card used for local exec approval;
               the node context is encoded into the command
               field: "node '<name>': <tool> — <reason>")

  │  operator approves / denies / approves-for-session

decision returns down to B as the JSON-RPC response → B's
sandbox maps it to an outcome (approved / approved_session /
denied / timeout) — fail-closed on every path

The node's identity on every approval is stamped from the authenticated connection, never from request params — so a node cannot forge another node's identity or approve itself. The request is approval.**-scoped and resolves through the existing exec.approval.resolve method; no new Panel UI or event scope was needed.


Security model

node_invoke is, by design, a remote code-execution channel — the center can run commands on a node. The boundaries are therefore non-negotiable:

  • The node-side allowlist is the only security boundary. B exposes only explicitly declared commands; the center can call only approved ones. Default deny. Even a compromised center can do only what B permits.
  • Bidirectional trust. B dialing in + holding a center-issued node token = B trusts that center. The center holding B's allowlist = the center can only drive B within the boundary.
  • Tier gating on the human. Whoever triggers node_invoke is bound by the center's tiering — operators may orchestrate the cluster; chat/guest sessions are read-only (environments.list) and cannot invoke.
  • Sensitive node operations route to approval. Commands marked at the config tier suspend and wait for the center operator (see approval routing above).
  • Credential isolation. The node token is never the host token, mirroring the "the host token never leaks to a remote" discipline of shell-core separation.
  • Transport. Same explicit trade-off as remote shells: plaintext over LAN/Tailscale is acceptable; run over a private network.

For file transfer specifically, both ends verify SHA-256, a single frame is capped at 8 MB, and the node jails every path inside its session workspace (an explicit canonical_root containment check, not just a deny-list).


Architectural fit

The cluster adds capability without touching the agent loop. The harness still only schedules Think→Act — it does not even know a tool is remote.

RedlineHow it holds
R1 — Brain/limb separationFederation is core↔node, all Rust. A node's platform capabilities still go through its local DesktopCapability trait + bridge; src never touches platform APIs.
R3 — Core minimalismZero new heavy dependencies; reuses the existing WS / JSON-RPC / DashMap stack.
R4 — Interface is pure I/OThe Panel only renders environments (a thin contract) — it does not aggregate, route, or persist.
R6 — One core, many channelsThe single center is the only brain; the cluster is its "one core, many bodies" extension.
R7 — LLM sovereigntyWhich machine, which params, retry on failure — all left to the model. No deterministic intent classification or routing engine.
R8 — Everything is a toolCluster management (enroll / approve / expose / list) and node capabilities are all tools — conversation is the control panel.
R9 — Intelligence in the promptenvironments is injected as data into the prompt; one inference pass covers the placement decision.
R10 — Thin harnesssrc/harness/ gains no logic; NodeRegistry, reverse RPC, and node_invoke live in cluster / gateway / builtin_tools respectively.

See also

On this page