Agent Platform
This document describes the agent infrastructure end-to-end: how agent sandboxes are provisioned, how the orchestrator manages job lifecycles, and how Claude chat connects through to a running agent pod.
Component Map
Claude.ai / Claude Code (external)
│
│ HTTPS mcp.jomcgi.dev
▼
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ MCP OAuth Proxy (prod / mcp-gateway namespace) │
│ projects/agent_platform/mcp_oauth_proxy │
│ OAuth 2.1 AS — Google OIDC — injects X-Forwarded-User │
└──────────────────────────────────────────┬─────────────────────────────────────────────────┘
│ proxies to ClusterIP :8000
▼
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ Context Forge (prod / mcp-gateway namespace) │
│ projects/agent_platform/context_forge · IBM mcp-context-forge v1.0.0-RC1 │
│ MCP gateway — aggregates tool servers, RBAC by team │
│ Backends: Postgres (state) + Redis (sessions) │
└───────┬──────────────────────────────────┬───────────────────────┬─────────────────────────┘
│ │ │
▼ ▼ ▼
signoz-mcp buildbuddy-mcp agent-orchestrator-mcp
argocd-mcp kubernetes-mcp todo-mcp
(projects/agent_platform/mcp_servers_chart — one pod per server, registered at startup)
│
│ HTTP ClusterIP :8080
▼
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ Agent Orchestrator (prod / agent-orchestrator namespace) │
│ projects/agent_platform/orchestrator · projects/agent_platform/orchestrator/deploy │
│ Go service — HTTP API + NATS JetStream consumer │
└──────────┬─────────────────────────────────────────────────────────────────────────────────┘
│ SandboxClaim CRUD + pod/exec
▼
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ Agent Sandbox Controller (cluster-critical) │
│ projects/platform/agent-sandbox · registry.k8s.io/agent-sandbox v0.1.1 │
│ CRDs: Sandbox · SandboxTemplate · SandboxClaim · SandboxWarmPool │
└──────────┬─────────────────────────────────────────────────────────────────────────────────┘
│ allocates pod from warm pool / creates pod
▼
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ Goose Sandbox Pod (prod / goose-sandboxes namespace) │
│ projects/agent_platform/sandboxes + projects/agent_platform/goose_agent (apko image) │
│ Runs: goose run --text <task> │
│ Tools: developer (builtin) · context-forge (MCP) · github │
│ LLM: Claude Max via LiteLLM proxy (claude-code provider) │
└────────────────────────────────────────────────────────────────────────────────────────────┘1. Agent Provisioning
Controller: projects/platform/agent-sandbox — runs in agent-sandbox-system (cluster-critical)
The kubernetes-sigs/agent-sandbox controller (SIG Apps, v0.1.1) manages isolated agent pod lifecycle via purpose-built CRDs. It fills the gap between Deployments and StatefulSets with a single stateful pod abstraction.
CRDs
| CRD | Purpose |
|---|---|
Sandbox (agents.x-k8s.io/v1alpha1) | Single-pod workload with PVC, headless Service, auto-delete lifecycle |
SandboxTemplate (agents.x-k8s.io/v1alpha1) | Reusable pod spec — image, env, resources defined once |
SandboxClaim (extensions.agents.x-k8s.io/v1alpha1) | Per-job request that claims a sandbox from a pool or creates one |
SandboxWarmPool (agents.x-k8s.io/v1alpha1) | Pre-warmed pods for near-instant allocation |
Goose Sandboxes
Chart: projects/agent_platform/sandboxes/ — deployed to goose-sandboxes namespace
Installs:
SandboxTemplatenamedgoose-agent(references the apko-built image)SandboxWarmPoolnamedgoose-pool(size: 1)LimitRange— 1–4 CPU, 2–8Gi memory per podResourceQuota— max 5 pods, 8 CPU, 16Gi across namespace- 1Password secrets: Claude OAuth token, GitHub PAT + BuildBuddy key, per-profile MCP tokens
Goose Agent Image
Built with: apko + rules_apko (projects/agent_platform/goose_agent/image/apko.yaml) Registry: ghcr.io/jomcgi/homelab/projects/agent_platform/goose_agent/imageArchitectures: x86_64 + aarch64 · User: uid/gid 65532
Wolfi packages baked in:
| Package | Purpose |
|---|---|
goose | Agent framework — entrypoint |
go | Build/test Go services |
nodejs + pnpm | Build frontend apps |
git + gh | Clone repos, push branches, open PRs |
bash, coreutils, busybox | Shell tooling for recipe scripts |
ca-certificates-bundle | TLS for outbound HTTPS |
Goose extensions baked into the image (~/.config/goose/config.yaml):
| Extension | Type | Endpoint |
|---|---|---|
developer | builtin | Filesystem, shell, text editor (scoped to /workspace) |
context-forge | streamable_http | http://context-forge.mcp-gateway.svc.cluster.local:8000/mcp |
github | stdio | pnpm dlx @modelcontextprotocol/server-github (uses GITHUB_TOKEN) |
Agent Profiles
Profiles narrow tool access for specific task types. Each maps to a Goose recipe YAML and a scoped Context Forge token (stored in goose-mcp-tokens secret).
| Profile | Tools | Use case |
|---|---|---|
| (none) | All extensions | General coding tasks |
ci-debug | buildbuddy-mcp only | CI failure investigation |
code-fix | No cluster tools | Pure code changes, no observability access |
Profile definitions are documented in projects/agent_platform/sandboxes/profiles.yaml. Recipes live in projects/agent_platform/goose_agent/image/recipes/.
Long-Lived Agents
projects/agent_platform/sandboxes also supports persistent agents as Kubernetes Deployments. Each entry under agents: in values.yaml generates a ConfigMap (prompt) + Deployment (Goose runner). A checksum/prompt annotation on the pod template triggers rollouts when the prompt changes.
# projects/agent_platform/goose-sandboxes/deploy/values.yaml
agents:
ci-watcher:
enabled: true
prompt: |
Monitor open PRs for CI failures and fix them...2. Lean Agent Toolchain
All container images are built remotely and hermetically via BuildBuddy RBE — never locally.
projects/agent_platform/goose_agent/image/
├── apko.yaml # Wolfi packages, uid 65532, dual-arch declaration
├── apko.lock.json # Pinned package SHAs — hermetic builds
├── config.yaml # Goose extensions baked into image
└── recipes/ # Goose recipe YAML per profile
├── ci-debug.yaml
└── code-fix.yamlImage pipeline for goose-agent:
bazel run //projects/agent_platform/goose_agent/image:push
│
├─ BuildBuddy RBE builds apko image (rules_apko)
├─ Hermetic: all deps from apko.lock.json SHAs
├─ Output: dual-arch OCI image
└─ Push: ghcr.io/jomcgi/homelab/projects/agent_platform/goose_agent/image:<tag>
│
└─ ArgoCD Image Updater detects new digest
└─ Writes back to projects/agent_platform/goose-sandboxes/deploy/values.yamlThe agent-orchestrator Go binary follows the same pattern:
projects/agent_platform/orchestrator/ -> go_binary -> go_image (apko base)
-> ghcr.io/jomcgi/homelab/projects/agent_platform/orchestratorNo Dockerfiles. All images: apko-based, dual-arch, non-root (uid 65532), capabilities.drop: [ALL].
3. Agent Orchestrator
Source: projects/agent_platform/orchestrator/ (Go) Chart: projects/agent_platform/orchestrator/deploy/Deploy: projects/agent_platform/agent-orchestrator/deploy/In-cluster: http://agent-orchestrator.agent-orchestrator.svc.cluster.local:8080 (ClusterIP only)
A single Go binary combining an HTTP API and a NATS JetStream consumer. Accepts job submissions, queues them durably, and executes them in isolated Goose sandbox pods.
Architecture
┌────────────────────────────────────────────────────────┐
│ agent-orchestrator │
│ │
│ HTTP :8080 │
│ ┌──────────┐ NATS JetStream │
│ │ REST API ├──▶ stream: agent-jobs │
│ │ │ subject: agent.jobs │
│ │ │ WorkQueue · max 1000 msgs │
│ └────┬─────┘ │ │
│ │ │ pull (MaxAckPending=3) │
│ │ ▼ │
│ │ ┌────────────────────┐ │
│ │ │ Consumer goroutine │ │
│ │ │ (up to 3 concurrent)│ │
│ ▼ └─────────┬──────────┘ │
│ ┌──────────┐ │ │
│ │ NATS KV │◀───────┘ │
│ │ bucket: │ job records (TTL 7 days) │
│ │job-records│ │
│ └──────────┘ │
└────────────────────────────────────────────────────────┘Both the JetStream stream (agent-jobs) and KV bucket (job-records) are self-provisioned on startup via idempotent CreateOrUpdate calls — no manual NATS setup required.
Job Lifecycle
POST /jobs
│
▼
PENDING ──▶ RUNNING ──▶ SUCCEEDED
└──▶ FAILED (retries exhausted)
└──▶ CANCELLED (via API or KV flag)
└──▶ PENDING (retry — message NAK'd for re-delivery)State is persisted in NATS KV, keyed by ULID (lexicographically sortable = free chronological ordering).
Cancellation is cooperative: the consumer polls KV status before each lifecycle phase. Setting status: CANCELLED in KV is sufficient — no separate signal channel.
Retry with context inheritance: On failure with retries remaining, the next attempt's prompt is enriched with the previous exit code and last 2,000 chars of output, helping the agent avoid the same failure mode.
Inactivity watchdog: Output streams through a syncBuffer. If no bytes arrive within 10 minutes (configurable), the execution context is cancelled — prevents hung Goose sessions from blocking the queue.
Consumer: Sandbox Execution Steps
1. Pull job ID from JetStream
2. Read JobRecord from NATS KV
3. If CANCELLED -> ACK and skip
4. Create SandboxClaim:
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxClaim
spec.sandboxTemplateRef.name: "goose-agent"
spec.lifecycle.shutdownPolicy: "Delete"
5. Poll SandboxClaim.status.sandbox until name appears
6. Resolve pod name from Sandbox.annotations["agents.x-k8s.io/pod-name"]
7. Wait for goose container Ready
8. Exec (refresh): git -C /workspace/homelab pull --ff-only origin main
9. Exec (run): goose run --text <task>
(profile): goose run --recipe <path> --no-profile --params task_description=<task>
10. Capture stdout+stderr -> syncBuffer (last 32KB)
11. Flush output to KV every 30s (live progress visible via GET /jobs/{id}/output)
12. On exit: KV -> SUCCEEDED | FAILED | CANCELLED
13. Delete SandboxClaim -> controller cleans up podStep 8 ensures agents always work from the latest main.
REST API
| Method | Path | Description |
|---|---|---|
POST | /jobs | Submit job -> 202 Accepted |
GET | /jobs | List jobs (?status=RUNNING,PENDING, ?limit=, ?offset=) |
GET | /jobs/{id} | Job detail + all attempt records |
POST | /jobs/{id}/cancel | Cancel PENDING or RUNNING job |
GET | /jobs/{id}/output | Latest attempt output (last 32KB) |
GET | /health | Liveness / readiness |
Submit example:
// POST /jobs
{ "task": "Fix the flaky test in services/grimoire", "profile": "ci-debug", "max_retries": 2 }
// 202 Accepted
{ "id": "01JQXK5P...", "status": "PENDING", "created_at": "2026-03-08T..." }Data Model
type JobRecord struct {
ID string // ULID — lexicographically sorted by time
Task string
Profile string // "", "ci-debug", "code-fix"
Status JobStatus // PENDING | RUNNING | SUCCEEDED | FAILED | CANCELLED
CreatedAt time.Time
UpdatedAt time.Time
MaxRetries int // default: 2, max: 10
Source string // "api" | "github" | "cli"
Attempts []Attempt
}
type Attempt struct {
Number int // 1-based
SandboxClaimName string // "orch-<ulid>-<attempt>"
ExitCode *int
Output string // last 32KB of goose stdout+stderr
Truncated bool
StartedAt time.Time
FinishedAt *time.Time
}RBAC
The orchestrator's ServiceAccount has the minimum permissions needed to drive sandbox lifecycle:
| Resource | Verbs |
|---|---|
extensions.agents.x-k8s.io/sandboxclaims | create, get, list, watch, delete |
agents.x-k8s.io/sandboxes | get, list, watch |
core/pods | get, list, watch |
core/pods/exec | create |
4. Claude Chat -> Agent Orchestrator MCP
Source: projects/agent_platform/orchestrator/mcp/ (Python, FastMCP + httpx) Transport: STREAMABLEHTTPDeployed via: projects/agent_platform/mcp_servers_chart/ (entry in projects/agent_platform/mcp-servers/deploy/values.yaml) In-cluster: http://agent-orchestrator-mcp.mcp-servers.svc.cluster.local:8000
A thin FastMCP wrapper around the orchestrator REST API. Registered with Context Forge at deploy time by projects/agent_platform/mcp_servers_chart/templates/registration-job.yaml.
MCP Tools
| Tool | Wraps | Description |
|---|---|---|
submit_job | POST /jobs | Queue a task for agent execution |
list_jobs | GET /jobs | List jobs with status filter and pagination |
get_job | GET /jobs/{id} | Full job record with attempt history |
cancel_job | POST /jobs/{id}/cancel | Cancel a pending or running job |
get_job_output | GET /jobs/{id}/output | Latest attempt output (last 32KB) |
Full Request Path: Claude Chat -> Running Agent
sequenceDiagram
actor Claude as Claude.ai / Claude Code
participant Proxy as MCP OAuth Proxy
participant CF as Context Forge
participant MCP as agent-orchestrator-mcp
participant Orch as agent-orchestrator
participant NATS as NATS JetStream
participant Ctrl as agent-sandbox controller
participant Pod as Goose sandbox pod
Claude->>Proxy: MCP: submit_job(task="Fix CI")
Proxy->>Proxy: Validate OAuth JWT (Google OIDC)
Proxy->>CF: forward + X-Forwarded-User
CF->>CF: RBAC check (team scope)
CF->>MCP: route to agent-orchestrator-mcp
MCP->>Orch: POST /jobs {task, profile, max_retries}
Orch->>NATS: KV PUT job-records/<ULID> {PENDING}
Orch->>NATS: JS PUB agent.jobs <ULID>
Orch-->>Claude: 202 {id: "01JQ...", status: "PENDING"}
Note over Orch,Pod: Consumer goroutine (async)
NATS-->>Orch: Pull job ID
Orch->>NATS: KV PUT {RUNNING}
Orch->>Ctrl: Create SandboxClaim "orch-01jq...-1"
Ctrl->>Pod: Allocate from warm pool
Orch->>Pod: exec: git pull --ff-only origin main
Orch->>Pod: exec: goose run --text "Fix CI"
Pod->>CF: MCP tool calls (SigNoz logs, ArgoCD status…)
Pod->>Pod: edit code · git commit · git push · gh pr create
Pod-->>Orch: exit 0
Orch->>NATS: KV PUT {SUCCEEDED}
Orch->>Ctrl: Delete SandboxClaim
Claude->>CF: MCP: get_job_output(id="01JQ...")
CF->>MCP: route
MCP->>Orch: GET /jobs/01JQ.../output
Orch->>NATS: KV GET job-records/01JQ...
Orch-->>Claude: {output: "PR #42 opened", exit_code: 0}Polling: The MCP server is stateless. Claude must call get_job or get_job_output to poll for progress. Output is flushed to NATS KV every 30 seconds during execution, so intermediate results are visible before the job completes.
5. Context Forge
Chart: projects/agent_platform/context_forge/ (wraps upstream IBM mcp-stack Helm chart) Deploy: projects/agent_platform/context-forge/deploy/Namespace: mcp-gatewayExternal: https://mcp.jomcgi.dev/mcp/ (Cloudflare tunnel -> MCP OAuth Proxy -> Context Forge) In-cluster: http://context-forge.mcp-gateway.svc.cluster.local:8000/mcp
IBM's mcp-context-forge aggregates multiple upstream MCP servers behind a single endpoint with RBAC-based tool distribution.
Deployed Components
| Component | Purpose |
|---|---|
| Context Forge gateway | MCP protocol routing, tool registry, RBAC |
| Postgres | Durable state — tool registrations, teams, tokens |
| Redis | Session caching |
| Schema migration Job | Applied on upgrade, pinned to same image tag as gateway |
Auth Stack (External Access)
Claude.ai / Claude Code
│
│ HTTPS
▼
Cloudflare Tunnel (DDoS protection, TLS termination)
│
▼
MCP OAuth Proxy (obot-platform/mcp-oauth-proxy)
│ - RFC 9728 discovery + DCR (Dynamic Client Registration)
│ - Delegates identity to Google OIDC
│ - Issues its own short-lived JWTs to MCP clients
│ - Injects X-Forwarded-User: <google email>
▼
Context Forge (TRUST_PROXY_AUTH=true)
│ - Reads identity from X-Forwarded-User header
│ - Resolves team membership from identity
│ - Applies RBAC: tools.read + tools.execute for "developer" role
▼
MCP Server (e.g., agent-orchestrator-mcp)In-cluster agents (Goose pods) reach Context Forge directly via ClusterIP at :8000 — no auth required.
RBAC Model
Context Forge uses two authorization layers (see ADR 005):
- Token scoping — JWT
teamsclaim controls which tools an agent can see - Role —
developergrantstools.read+tools.execute
| Client | Team | Visible tools |
|---|---|---|
| Claude Code / Claude.ai | infra-agents | All registered servers |
| Claude.ai web chat | web-chat | SigNoz read tools only |
| In-cluster Goose pods (ClusterIP) | — (bypass auth) | All tools |
Registered MCP Servers
All servers run in mcp-servers namespace. Registration happens once at deploy time via a Kubernetes Job that calls the Context Forge admin API.
| Server | Image | Transport | Tools |
|---|---|---|---|
signoz-mcp | docker.io/signoz/signoz-mcp-server | STREAMABLEHTTP | Logs, traces, metrics, alerts, dashboards |
buildbuddy-mcp | homelab Go service | STREAMABLEHTTP | CI invocations, build logs, targets |
kubernetes-mcp | ghcr.io/containers/kubernetes-mcp-server | STREAMABLEHTTP | Pod list/logs/exec, resource reads |
argocd-mcp | ghcr.io/argoproj-labs/mcp-for-argocd | STREAMABLEHTTP | App status, sync, history |
todo-mcp | homelab Python service | STREAMABLEHTTP | Todo CRUD |
agent-orchestrator-mcp | homelab Python service | STREAMABLEHTTP | Job submit/list/cancel/output |
All server definitions live in projects/agent_platform/mcp-servers/deploy/values.yaml. ArgoCD Image Updater maintains digest-pinned image tags automatically.
6. Agent Orchestrator Events
The orchestrator uses NATS JetStream as both job queue and state store.
NATS Resources
| Resource | Type | Config |
|---|---|---|
agent-jobs stream | WorkQueue | subject: agent.jobs, max 1000 msgs |
job-records KV bucket | KeyValue | keyed by ULID, TTL 7 days |
orchestrator consumer | Durable pull | MaxAckPending=3, AckWait=JOB_MAX_DURATION+1m |
All three are self-provisioned on orchestrator startup. Single-node NATS at nats://nats.nats.svc.cluster.local:4222 (projects/platform/nats/).
Event Flow
POST /jobs ──▶ KV PUT job-records/<ULID> { status: PENDING }
──▶ JS PUB agent.jobs <ULID>
Consumer pull
├─▶ KV PUT { status: RUNNING, attempts: [{...}] }
├─▶ [every 30s] KV PUT { attempts[-1].output: <partial> }
└─▶ KV PUT { status: SUCCEEDED | FAILED | CANCELLED }
JS ACK (success / retries exhausted / cancelled)
JS NAK (retry — message redelivered)State transitions are the events. Job status changes are immediately visible via GET /jobs/{id} or direct KV reads.
Consuming State Changes Externally
Any service with NATS access can watch the KV bucket for real-time job state changes:
// Watch all job-record changes (delta delivery — not full scans)
watcher, _ := kv.WatchAll(ctx)
for entry := range watcher.Updates() {
var job JobRecord
json.Unmarshal(entry.Value(), &job)
// react to job.Status transitions
}This is the intended extension point for future webhook dispatch, DLQ handling, or GitHub issue creation on failure (see ADR 007).
Related ADRs
| ADR | Decision |
|---|---|
| 001 - Background Agents | Initial motivation |
| 002 - OpenHands Agent Sandbox | Superseded approach |
| 003 - Context Forge | MCP gateway deployment |
| 004 - Autonomous Agents | Goose + agent-sandbox architecture |
| 005 - Role-Based MCP Access | Context Forge RBAC model |
| 006 - OIDC Auth MCP Gateway | OAuth proxy + Google OIDC |
| 007 - Agent Orchestrator | Orchestrator service design |
Quick Reference
# Explore the agent platform
ls projects/platform/agent-sandbox/ # Controller chart + CRDs
ls projects/agent_platform/sandboxes/ # SandboxTemplate, warm pool, namespace config
ls projects/agent_platform/goose_agent/image/ # apko spec, Goose config, recipes
ls projects/agent_platform/orchestrator/ # Go service source (api.go, consumer.go, sandbox.go)
ls projects/agent_platform/orchestrator/mcp/ # Python MCP wrapper
ls projects/agent_platform/orchestrator/deploy/ # Orchestrator Helm chart + ArgoCD Application
ls projects/agent_platform/context_forge/deploy/ # MCP gateway (wraps IBM mcp-stack)
ls projects/agent_platform/mcp_servers_chart/deploy/ # All MCP server pods + registration jobs
ls projects/agent_platform/sandboxes/deploy/ # Prod sandbox values + image tags
ls projects/agent_platform/mcp_servers/deploy/ # MCP servers deploy config