Session Orchestration

Supervisor sessions that spawn, observe, and coordinate worker sessions in the same project.

A session can act as a supervisor that spawns, observes, messages, interrupts, and kills other sessions in the same project. The instance-level surface is enabled by default (unless disabled by operator config or MINIMAL_UI); each session that should run as a supervisor still opts in separately from the chat-view toolbar.

Different from pi-subagents. Subagents are LLM-internal child agents scoped to one tool call within a single session. Orchestrated workers are full first-class sessions — they show up in the session picker, have their own transcripts, can be opened directly in the browser, and survive after the supervisor ends.

Why turn it on

Two concrete workflows the feature is built for:

Topology

Strict hub-and-spoke at depth=1.

Availability — default-on, per-session opt-in

1. Instance disable switch (operator)

Session orchestration is available by default. To remove the instance-level surface entirely, set:

ORCHESTRATION_DISABLED=true

Or in docker/.env:

ORCHESTRATION_DISABLED=true

For backward compatibility, ORCHESTRATION_ENABLED=false also disables orchestration. ORCHESTRATION_ENABLED=true is no longer required; it is the default behavior. Availability makes the chat-view Orch toggle render. It does NOT enable supervisor mode for any session — that's the second gate.

Hard-disabled under MINIMAL_UI=true regardless of these flags. MINIMAL_UI deployments are locked-down by design; orchestration opens a tool surface the operator probably doesn't want end users touching.

2. Per-session toggle (user)

Open a session, click the Orch button in the chat-view toolbar, click Enable supervisor mode. The server rebuilds the live AgentSession in-place (same SessionManager, fresh customTools, SSE clients stay attached) so the orchestrate_* tools become available immediately — no reload, no reconnect.

Disable the same way to drop the tools. Disabling also detaches every linked worker (they survive as standalone sessions) and clears the supervisor's inbox.

Tool surface

The supervisor's agent gets eight new tools (all prefixed orchestrate_):

Tool Purpose
orchestrate_spawn_worker Create a new worker session with a task. name is required so the picker stays legible; optional contextSummary prepends handoff context to the initial prompt.
orchestrate_list_workers Current state per worker (streaming / idle / cold) with message counts and last-activity timestamps.
orchestrate_read_worker Fetch the worker's most recent messages, rendered as a readable transcript. Default limit is 1 (latest message only) — bump only when one-message context isn't enough.
orchestrate_send_to_worker Inject a message into the worker's prompt stream. mode chooses between prompt (default, new turn or queue), steer (interrupt current turn — for course-correction), followUp (queue until idle — for next-task assignment).
orchestrate_interrupt_worker Abort the worker's current turn. Session stays live.
orchestrate_kill_worker Dispose the worker session. Optional deleteOnDisk: true also removes the .jsonl.
orchestrate_detach_worker Drop the supervisor↔worker link; the worker continues as a standalone session.
orchestrate_read_inbox Drain pending worker events (turn-ends, ask-user-question requests, retry failures, process alerts, deletions). Items return oldest-first and get marked delivered.

Workers do NOT get these tools — that's the topology enforcement.

Why prompts to workers should read as task briefs

The initialPrompt and send_to_worker.message parameters are described to the supervisor LLM as tasks, not conversation. Workers are autonomous agents with no shared transcript or memory — "can you help me look at the auth stuff" gives the worker nothing to act on. "Audit packages/server/src/auth.ts for the JWT verification path; report yes/no on whether expired tokens are rejected" is a usable brief.

Worker → supervisor wake-up

Worker events route into a per-supervisor inbox queue. Five event types:

Event Source
worker.ended Worker's AgentSession emitted agent_end
worker.ask_user Worker called ask_user_question
worker.auto_retry_failed Worker's SDK auto-retry exhausted on a provider error
worker.process_alert Not enqueued for worker process alerts; process success/failure/kill notifications stay in the worker session and do not wake the supervisor
worker.deleted Worker's session was deleted externally (cold or live); kills initiated through orchestration tools/UI controls update worker state but do not notify the supervisor about its own action.
worker.detached Worker's supervisor link was detached by a human Web UI/API action; orchestrate_detach_worker remains a self-action and does not notify the supervisor.

When an event lands, the bridge enqueues it AND tries to wake the supervisor with a small [orchestration] N pending events… prompt. The wake-up only fires when:

  1. The supervisor session is live (in the in-memory registry), AND
  2. The supervisor is idle (not currently mid-turn), AND
  3. There isn't already an in-flight wake-up for the current idle window (per-supervisor dedupe flag).

Orchestration-initiated kills are filtered before they reach the inbox, so the supervisor is not woken for its own worker-deletion tool/UI action.

If the supervisor is mid-turn, the items wait on the queue. Recovery fires another wake-up when the supervisor's own agent_end lands — closing the "supervisor finished a turn without calling orchestrate_read_inbox" gap.

The supervisor's LLM is expected to call orchestrate_read_inbox to drain the queue, then orchestrate_read_worker / orchestrate_send_to_worker to act on individual workers.

Safety

Storage

Two files in ${FORGE_DATA_DIR}:

File Purpose Mode
session-orchestration.json Supervisor opt-in flags + supervisor↔worker links 0600
orchestrator-inbox.json Per-supervisor pending event queue (FIFO-capped at 200 / supervisor) 0600

Same atomic-write + in-process-lock pattern as webhooks.md. The inbox file can carry assistant- message previews — anything the worker's agent has been talking about — so it gets the secret-grade 0600 even though it has no literal secrets.

REST surface

The agent-facing tools above are mirrored as REST endpoints the UI calls. All are gated by the instance disable switch and the MINIMAL_UI hard gate — they return 403 orchestration_disabled or 403 minimal_ui_disabled when off.

Method + path Purpose
GET /api/v1/orchestration/config Instance availability + caps. Used by the UI to decide whether to render the toggle.
GET /api/v1/orchestration/sessions/:id Role of a session (supervisor / worker / standalone) + linkage.
POST /api/v1/orchestration/sessions/:id/enable Make the session a supervisor; rebuild the live AgentSession in place.
POST /api/v1/orchestration/sessions/:id/disable Drop supervisor mode; detach workers; clear inbox; rebuild.
GET /api/v1/orchestration/sessions/:id/workers Live worker list for a supervisor (drives the Workers panel).
GET /api/v1/orchestration/sessions/:id/inbox Full inbox history (delivered + pending), newest first.
POST /api/v1/orchestration/sessions/:id/inbox/clear Wipe inbox.
POST /api/v1/orchestration/sessions/:id/workers/:wid/detach Web UI/API detach. Notifies the supervisor before unlinking because a human detached the worker externally.
POST /api/v1/orchestration/sessions/:id/workers/:wid/kill Programmatic orchestration kill that suppresses self-notification to the supervisor. The Web UI worker Kill button uses generic DELETE /sessions/:id instead so a human-initiated deletion notifies the supervisor.
POST /api/v1/orchestration/sessions/:id/workers/:wid/resume Force-resume a cold worker into the registry.

Full schemas: open /api/docs in your deploy.

UI surface

Troubleshooting

Orch button isn't visible. Either ORCHESTRATION_DISABLED=true (or legacy ORCHESTRATION_ENABLED=false) is set, or MINIMAL_UI=true is overriding it. Check GET /api/v1/orchestration/configdisabledReason tells you which gate is closed.

Enabling supervisor mode succeeded but orchestrate_* tools aren't visible to the agent. The route rebuilds the live AgentSession in place, but only for sessions that are currently live in the registry. A cold session needs to be opened (SSE attach) first; opening it triggers resumeSession, which picks up the supervisor flag and includes the tools in customTools.

Supervisor doesn't react to worker activity. Two failure modes to distinguish:

  1. The supervisor LLM isn't calling orchestrate_read_inbox. The inbox file is being written to; check ${FORGE_DATA_DIR}/orchestrator-inbox.json and look for delivered: false items. The wake-up prompt may have been ignored — strengthen the supervisor's system prompt or instruct it explicitly.
  2. The wake-up isn't firing. Look for orchestration-wake-delivered / orchestration-wake-failed lines in stderr. A "failed" line usually means the supervisor session has no model configured — supervisors need provider credentials just like any other session.

fanout_limit_exceeded errors. Spawn was rejected because the supervisor already has the max live workers. Either kill some (orchestrate_kill_worker), detach the ones you no longer care about (orchestrate_detach_worker), or raise ORCHESTRATION_MAX_WORKERS_PER_SUPERVISOR.

depth_limit_exceeded. A worker is trying to become a supervisor. v1 disallows this by design — depth is capped at 1.

See also