ADR 005: Repo Markdown Knowledge Graph Sync via OCI Volume
Author: jomcgi Status: Draft Created: 2026-04-11
Problem
The homelab repo contains markdown files spread across the entire tree — ADRs in docs/decisions/, design plans in docs/plans/, operational docs (security.md, observability.md), service-level READMEs, operator best practices (projects/operators/best-practices.md), CLAUDE.md files, and more. None of this is visible to the knowledge graph.
This means:
- Semantic search misses repo context — asking "what decisions have we made about auth?" only surfaces personal vault notes, not the ADR that drove the middleware rewrite
- No cross-linking — repo-derived knowledge can't form
derives_fromorrelatededges with personal knowledge atoms - Stale knowledge persists — when an ADR is superseded, a design plan is abandoned, or a service doc is removed, the knowledge graph has no mechanism to detect or remove outdated artifacts
- Scattered context — service-specific docs, operator guides, and project instructions live alongside code but aren't discoverable via the knowledge API
The gardener already decomposes raw vault notes into typed knowledge artifacts (atoms, facts, active items). The gap is getting all repo markdown into its input pipeline — and cleaning up when files are removed or superseded.
Proposal
Collect all markdown files in the repo into a lightweight OCI artifact at CI time via a top-level Bazel filegroup glob. Mount it as an init container volume in the monolith pod. The gardener treats the mount path as a second input directory alongside the vault, processing markdown into the knowledge graph with full decomposition, edge creation, and embedding.
A top-level filegroup in the root BUILD file (repo_markdown) globs all **/*.md files, excluding vendored/generated content. This single source of truth feeds both the OCI artifact build and any future tooling that needs repo-wide markdown access.
A manifest file (manifest.json) in the OCI artifact lists all included files with content hashes. On startup, the monolith compares the manifest against previously-processed repo docs and deletes _processed/ artifacts whose source files no longer exist — solving the stale knowledge problem.
| Aspect | Today | Proposed |
|---|---|---|
| Repo docs in KG | Not indexed | Decomposed into typed artifacts with edges |
| Update trigger | N/A | CI rebuild on push to main → Image Updater bumps digest |
| Deletion handling | N/A | Manifest diff removes orphaned artifacts |
| Runtime deps | N/A | None — docs baked into OCI artifact at build time |
| Gardener changes | Scans vault only | Scans vault + repo docs mount |
Architecture
Build Pipeline
graph LR
A[push to main] --> B[Bazel builds<br/>docs OCI artifact]
B --> C[Push to GHCR<br/>ghcr.io/jomcgi/homelab/repo-docs:latest]
C --> D[Image Updater<br/>bumps digest]
D --> E[Pod restart<br/>init container copies docs]
E --> F[Gardener processes<br/>new/changed docs]Bazel Filegroup
A top-level filegroup in the root BUILD file provides the canonical set of repo markdown:
filegroup(
name = "repo_markdown",
srcs = glob(
["**/*.md"],
exclude = [
"node_modules/**",
"bazel-*/**",
"**/testdata/**",
],
),
visibility = ["//visibility:public"],
)This feeds the OCI artifact build and can be reused by other targets (linting, search indexing, etc.).
OCI Artifact Contents
The artifact preserves the repo's directory structure so file paths in the knowledge graph match repo paths:
/repo-markdown/
├── manifest.json # {"files": {"docs/security.md": "sha256:...", ...}, "built_at": "..."}
├── docs/
│ ├── decisions/
│ │ ├── agents/001-background-agents.md
│ │ └── services/005-repo-docs-knowledge-sync.md
│ ├── plans/
│ │ └── ...
│ ├── observability.md
│ └── security.md
├── projects/
│ ├── monolith/CLAUDE.md
│ ├── operators/best-practices.md
│ └── ...
├── .claude/CLAUDE.md
└── CLAUDE.mdPod Integration
graph TB
subgraph "Monolith Pod"
IC[init: repo-markdown<br/>copies to emptyDir] -->|"/repo-markdown"| MO[monolith<br/>FastAPI + gardener]
MO -->|scans| VR["/vault<br/>(Obsidian notes)"]
MO -->|scans| RD["/repo-markdown<br/>(OCI volume)"]
end
MO -->|"_processed/ artifacts"| PG[(PostgreSQL<br/>embeddings + edges)]The init container uses the OCI artifact image and copies its contents to a shared emptyDir volume at /repo-markdown. The gardener scans this path as a second raw input source.
Deletion Flow
sequenceDiagram
participant IC as Init Container
participant MO as Monolith
participant PG as PostgreSQL
IC->>MO: Copy new docs + manifest.json to /repo-docs
MO->>PG: Load previous manifest from repo_doc_state table
MO->>MO: Diff current vs previous manifest
alt Files removed
MO->>PG: Delete AtomRawProvenance rows for removed files
MO->>PG: Delete _processed/ artifacts (cascade: chunks, edges)
Note over MO,PG: Reconciler picks up deletions on next cycle
end
MO->>PG: Store current manifest as new state
MO->>MO: Gardener processes new/changed files normallyChange Detection
The gardener's existing provenance system handles most of the lifecycle:
- New file: No provenance row → gardener decomposes it
- Unchanged file: Provenance row with current
GARDENER_VERSION→ skip - Changed file: Content hash in
RawInputchanges → gardener re-ingests (existing behavior via_ensure_raw_inputs) - Deleted file: Manifest diff detects removal → cleanup job deletes provenance + artifacts
- Gardener version bump: All repo docs re-processed (same as vault notes)
The only new logic is the manifest-based deletion — everything else rides on existing gardener infrastructure.
Implementation
Phase 1: Bazel Filegroup + OCI Artifact Build
- [ ] Add
repo_markdownfilegroup to rootBUILDfile (glob**/*.md, exclude vendored/generated) - [ ] Create
bazel/images/repo-markdown/with an apko config (minimal busybox image +/repo-markdowndirectory) - [ ] Add Bazel rule to copy
repo_markdownfilegroup into the image preserving directory structure - [ ] Add genrule to generate
manifest.jsonat build time with file paths and content hashes - [ ] Add
oci_pushtarget pushing toghcr.io/jomcgi/homelab/repo-markdown - [ ] Wire into CI (
buildbuddy.yaml) to build + push on main branch
Phase 2: Monolith Integration
- [ ] Add init container to monolith deployment template using the repo-markdown image
- [ ] Add
emptyDirvolume mounted at/repo-markdownshared between init container and monolith - [ ] Add
REPO_MARKDOWN_ROOTenv var to monolith container (default:/repo-markdown) - [ ] Add ArgoCD Image Updater annotation for the repo-markdown image (digest tracking)
Phase 3: Gardener Multi-Source
- [ ] Extend
_ensure_raw_inputsingardener.pyto scanREPO_MARKDOWN_ROOTin addition to vault - [ ] Tag repo-doc
RawInputrows withsource='repo'to distinguish from vault notes - [ ] Add
repo_doc_statetable (or reuse existing config table) to store the previous manifest - [ ] Implement manifest diff on startup: compare current
manifest.jsonvs stored state - [ ] For removed files: delete
AtomRawProvenancerows and corresponding_processed/artifacts - [ ] Store updated manifest after successful diff + cleanup
- [ ] Gardener prompt: include
source: repocontext so decomposition can reference the repo path in edges
Phase 4: Observability
- [ ] Add SigNoz metric:
knowledge.repo_docs.processed(counter of repo docs ingested per cycle) - [ ] Add SigNoz metric:
knowledge.repo_docs.deleted(counter of orphaned artifacts cleaned up) - [ ] Log manifest diff results at INFO level on each startup
Security
Baseline per docs/security.md. No deviations.
The OCI artifact contains only markdown files already public in the GitHub repo. No secrets, no credentials. The init container runs as uid 65532 (non-root) and only copies files to an emptyDir.
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Gardener decomposes ADRs poorly (too long/structured for atomic decomposition) | Medium | Low — bad atoms, not data loss | Tune gardener prompt for repo doc structure; add source: repo hint so model adapts |
| Manifest diff deletes artifacts for a temporarily-missing file (build race) | Low | Medium — knowledge gap until re-processed | Manifest is generated deterministically at build time; only diffs against stored state, not filesystem |
| Init container blocks pod startup if image pull fails | Low | Medium — monolith doesn't start | Use imagePullPolicy: IfNotPresent; repo-docs absence shouldn't block startup (make mount optional) |
| Large number of docs overwhelms gardener cycle | Low | Low — gardener already rate-limits | Gardener processes N raws per cycle with existing budget cap; repo docs queued like any other raw |
Open Questions
- Superseded ADR handling — when an ADR status changes to "Superseded by NNN", should the gardener emit a
supersedesedge and mark old artifacts, or just let the natural decomposition handle it? - Exclude patterns — beyond
node_modules/,bazel-*/, andtestdata/, are there other markdown files that should be excluded from the filegroup (e.g., vendored dependency docs, generated changelogs)?
References
| Resource | Relevance |
|---|---|
projects/monolith/knowledge/gardener.py | Existing gardener that will scan repo docs |
projects/monolith/knowledge/reconciler.py | Hash-based reconciler that handles artifact lifecycle |
projects/monolith/knowledge/models.py | AtomRawProvenance model for provenance tracking |
docs/decisions/agents/012-knowledge-gardener-model-pipeline.md | Gardener's two-tier model architecture |
docs/decisions/agents/013-knowledge-gardener-gemma4-only.md | Current gardener model (supersedes 012) |
docs/plans/2026-04-09-vault-git-sync.md | Vault git clone pattern (related but separate) |
| apko | Image build tool used for OCI artifact |
| ArgoCD Image Updater | Digest-based auto-update for the docs image |