Session Orchestration

Supervisor sessions that spawn, observe, and coordinate worker sessions in the same project.

A session can act as a supervisor that spawns, observes, messages, interrupts, and kills other sessions in the same project. The instance-level surface is enabled by default (unless disabled by operator config or MINIMAL_UI); each session that should run as a supervisor still opts in separately from the chat-view toolbar.

Different from pi-subagents. Subagents are LLM-internal child agents scoped to one tool call within a single session. Orchestrated workers are full first-class sessions — they show up in the session picker, have their own transcripts, can be opened directly in the browser, and survive after the supervisor ends.

Why turn it on

Two concrete workflows the feature is built for:

Supervisor watching peers. A lead session breaks down a goal, spawns one worker per discrete task, watches the worker inbox for turn-ends and ask-user-question requests, and coordinates the overall progress. Useful when work decomposes cleanly into parallel chunks (audit N files, test N components, port N modules).
Handoff pipeline. Session A finishes its piece of work, spawns session B with a contextSummary of what it learned, and detaches. Useful when work is sequential but the context window pressure on a single session would force premature compaction.

Topology

Strict hub-and-spoke at depth=1.

The supervisor is the hub. Every worker reports to it via the inbox; workers do not see each other.
A worker cannot become a supervisor (depth is capped at 1). This is enforced at enableSupervisor() and at registerWorker() — the store rejects with depth_limit_exceeded if you try.
Hub-and-spoke is enforced by the tool surface, not by a permission check: workers literally don't get the orchestrate_* tools, so there's no API for them to express worker→worker comms. There's nothing to enforce because there's nothing to express.
Same-project only in v1. Cross-project orchestration is intentionally out of scope — it would complicate the data model for a use case the design hasn't seen yet.

Availability — default-on, per-session opt-in

1. Instance disable switch (operator)

Session orchestration is available by default. To remove the instance-level surface entirely, set:

ORCHESTRATION_DISABLED=true

Or in docker/.env:

ORCHESTRATION_DISABLED=true

For backward compatibility, ORCHESTRATION_ENABLED=false also disables orchestration. ORCHESTRATION_ENABLED=true is no longer required; it is the default behavior. Availability makes the chat-view Orch toggle render. It does NOT enable supervisor mode for any session — that's the second gate.

Hard-disabled under MINIMAL_UI=true regardless of these flags. MINIMAL_UI deployments are locked-down by design; orchestration opens a tool surface the operator probably doesn't want end users touching.

2. Per-session toggle (user)

Open a session, click the Orch button in the chat-view toolbar, click Enable supervisor mode. The server rebuilds the live AgentSession in-place (same SessionManager, fresh customTools, SSE clients stay attached) so the orchestrate_* tools become available immediately — no reload, no reconnect.

Disable the same way to drop the tools. Disabling also detaches every linked worker (they survive as standalone sessions) and clears the supervisor's inbox.

Tool surface

The supervisor's agent gets eight new tools (all prefixed orchestrate_):

Tool	Purpose
`orchestrate_spawn_worker`	Create a new worker session with a task. `name` is required so the picker stays legible; optional `contextSummary` prepends handoff context to the initial prompt.
`orchestrate_list_workers`	Current state per worker (streaming / idle / cold) with message counts and last-activity timestamps.
`orchestrate_read_worker`	Fetch the worker's most recent messages, rendered as a readable transcript. Default `limit` is 1 (latest message only) — bump only when one-message context isn't enough.
`orchestrate_send_to_worker`	Inject a message into the worker's prompt stream. `mode` chooses between `prompt` (default, new turn or queue), `steer` (interrupt current turn — for course-correction), `followUp` (queue until idle — for next-task assignment).
`orchestrate_interrupt_worker`	Abort the worker's current turn. Session stays live.
`orchestrate_kill_worker`	Dispose the worker session. Optional `deleteOnDisk: true` also removes the `.jsonl`.
`orchestrate_detach_worker`	Drop the supervisor↔worker link; the worker continues as a standalone session.
`orchestrate_read_inbox`	Drain pending worker events (turn-ends, ask-user-question requests, retry failures, process alerts, deletions). Items return oldest-first and get marked delivered.

Workers do NOT get these tools — that's the topology enforcement.

Why prompts to workers should read as task briefs

The initialPrompt and send_to_worker.message parameters are described to the supervisor LLM as tasks, not conversation. Workers are autonomous agents with no shared transcript or memory — "can you help me look at the auth stuff" gives the worker nothing to act on. "Audit packages/server/src/auth.ts for the JWT verification path; report yes/no on whether expired tokens are rejected" is a usable brief.

Worker → supervisor wake-up

Worker events route into a per-supervisor inbox queue. Five event types:

Event	Source
`worker.ended`	Worker's AgentSession emitted `agent_end`
`worker.ask_user`	Worker called `ask_user_question`
`worker.auto_retry_failed`	Worker's SDK auto-retry exhausted on a provider error
`worker.process_alert`	Not enqueued for worker process alerts; process success/failure/kill notifications stay in the worker session and do not wake the supervisor
`worker.deleted`	Worker's session was deleted externally (cold or live); kills initiated through orchestration tools/UI controls update worker state but do not notify the supervisor about its own action.
`worker.detached`	Worker's supervisor link was detached by a human Web UI/API action; `orchestrate_detach_worker` remains a self-action and does not notify the supervisor.

When an event lands, the bridge enqueues it AND tries to wake the supervisor with a small [orchestration] N pending events… prompt. The wake-up only fires when:

The supervisor session is live (in the in-memory registry), AND
The supervisor is idle (not currently mid-turn), AND
There isn't already an in-flight wake-up for the current idle window (per-supervisor dedupe flag).

Orchestration-initiated kills are filtered before they reach the inbox, so the supervisor is not woken for its own worker-deletion tool/UI action.

If the supervisor is mid-turn, the items wait on the queue. Recovery fires another wake-up when the supervisor's own agent_end lands — closing the "supervisor finished a turn without calling orchestrate_read_inbox" gap.

The supervisor's LLM is expected to call orchestrate_read_inbox to drain the queue, then orchestrate_read_worker / orchestrate_send_to_worker to act on individual workers.

Safety

Fanout cap. Each supervisor can have at most ORCHESTRATION_MAX_WORKERS_PER_SUPERVISOR (default 8) live workers at a time. Killed / cold workers don't count. Bounded to [1, 100] so a typo in env doesn't disable the limit.
Ownership guard. Every tool that names a workerId verifies the worker is linked to this supervisor before acting. A confused supervisor LLM can't reach into another supervisor's workers by id-guessing — the worker store is the source of truth.
Depth cap. Workers cannot themselves become supervisors. No fork-bombs from a buggy prompt loop.
Same-project guard. spawn_worker always spawns into the supervisor's project; the parameter for cross-project doesn't exist in the tool schema.
Audit trail. Every orchestration tool call and inbox event writes a JSON-line entry to process.stderr (orchestration-inbox-enqueued, orchestration-wake-delivered, orchestration-wake-failed, etc.). No separate audit UI by design — operators tail container logs.

Storage

Two files in ${FORGE_DATA_DIR}:

File	Purpose	Mode
`session-orchestration.json`	Supervisor opt-in flags + supervisor↔worker links	`0600`
`orchestrator-inbox.json`	Per-supervisor pending event queue (FIFO-capped at 200 / supervisor)	`0600`

Same atomic-write + in-process-lock pattern as webhooks.md. The inbox file can carry assistant- message previews — anything the worker's agent has been talking about — so it gets the secret-grade 0600 even though it has no literal secrets.

REST surface

The agent-facing tools above are mirrored as REST endpoints the UI calls. All are gated by the instance disable switch and the MINIMAL_UI hard gate — they return 403 orchestration_disabled or 403 minimal_ui_disabled when off.

Method + path	Purpose
`GET /api/v1/orchestration/config`	Instance availability + caps. Used by the UI to decide whether to render the toggle.
`GET /api/v1/orchestration/sessions/:id`	Role of a session (`supervisor` / `worker` / `standalone`) + linkage.
`POST /api/v1/orchestration/sessions/:id/enable`	Make the session a supervisor; rebuild the live AgentSession in place.
`POST /api/v1/orchestration/sessions/:id/disable`	Drop supervisor mode; detach workers; clear inbox; rebuild.
`GET /api/v1/orchestration/sessions/:id/workers`	Live worker list for a supervisor (drives the Workers panel).
`GET /api/v1/orchestration/sessions/:id/inbox`	Full inbox history (delivered + pending), newest first.
`POST /api/v1/orchestration/sessions/:id/inbox/clear`	Wipe inbox.
`POST /api/v1/orchestration/sessions/:id/workers/:wid/detach`	Web UI/API detach. Notifies the supervisor before unlinking because a human detached the worker externally.
`POST /api/v1/orchestration/sessions/:id/workers/:wid/kill`	Programmatic orchestration kill that suppresses self-notification to the supervisor. The Web UI worker Kill button uses generic `DELETE /sessions/:id` instead so a human-initiated deletion notifies the supervisor.
`POST /api/v1/orchestration/sessions/:id/workers/:wid/resume`	Force-resume a cold worker into the registry.

Full schemas: open /api/docs in your deploy.

UI surface

Toolbar Orch button — renders when orchestration is available for the instance. Toggles the per-session Orchestration panel.
Orchestration panel (collapsible, mounted between the chat toolbar and the message list):
- Standalone session → "Enable supervisor mode" button.
- Supervisor session → workers list with state pills (streaming / idle / cold), per-worker Resume / Detach / Kill buttons, pending-inbox badge, expandable inbox history with type-coded labels, "Disable supervisor mode" button, "Clear inbox" button.
- Worker session → back-link to the supervisor (click to jump), handoff badge if spawned from a contextSummary.
The panel polls /orchestration/sessions/:id/workers and /inbox every 4s while open so the live state stays fresh.

Troubleshooting

Orch button isn't visible. Either ORCHESTRATION_DISABLED=true (or legacy ORCHESTRATION_ENABLED=false) is set, or MINIMAL_UI=true is overriding it. Check GET /api/v1/orchestration/config — disabledReason tells you which gate is closed.

Enabling supervisor mode succeeded but orchestrate_* tools aren't visible to the agent. The route rebuilds the live AgentSession in place, but only for sessions that are currently live in the registry. A cold session needs to be opened (SSE attach) first; opening it triggers resumeSession, which picks up the supervisor flag and includes the tools in customTools.

Supervisor doesn't react to worker activity. Two failure modes to distinguish:

The supervisor LLM isn't calling orchestrate_read_inbox. The inbox file is being written to; check ${FORGE_DATA_DIR}/orchestrator-inbox.json and look for delivered: false items. The wake-up prompt may have been ignored — strengthen the supervisor's system prompt or instruct it explicitly.
The wake-up isn't firing. Look for orchestration-wake-delivered / orchestration-wake-failed lines in stderr. A "failed" line usually means the supervisor session has no model configured — supervisors need provider credentials just like any other session.

fanout_limit_exceeded errors. Spawn was rejected because the supervisor already has the max live workers. Either kill some (orchestrate_kill_worker), detach the ones you no longer care about (orchestrate_detach_worker), or raise ORCHESTRATION_MAX_WORKERS_PER_SUPERVISOR.

depth_limit_exceeded. A worker is trying to become a supervisor. v1 disallows this by design — depth is capped at 1.