Hermetic Semgrep via Bazel
Author: Joe McGinley Status: Accepted Created: 2026-03-04 See also: rules_semgrep README
Problem
Agentic coding workflows (Claude Code, autonomous agents) need CI feedback in seconds, not minutes. Semgrep on managed CI infrastructure took 2m+ for diff scans and 5m+ for full scans — too slow for tight agent iteration loops.
Beyond speed, the standard Semgrep integration (pip install semgrep + registry rule fetches) breaks determinism. pip resolution is non-hermetic. Rule registry pulls vary across runs. The Python wrapper adds 2-4s startup overhead per invocation. Agents need identical results from identical inputs.
Bazel's content-addressed cache model is the answer: tests re-run only when their inputs change. But Semgrep has no native Bazel integration, and the engine ships as a Python wheel — not a standalone binary.
Proposal
Three-layer solution: (1) vendor semgrep-core OCaml binary as OCI artifacts on GHCR, bypassing the Python wrapper entirely; (2) Bazel rules that wire engine + rules + sources into cacheable sh_test targets; (3) a Gazelle extension that auto-generates scan targets from the dependency graph.
graph TD
subgraph "Daily Update · GitHub Actions"
PYPI["PyPI Wheels"] --> EXTRACT["Extract semgrep-core binary"]
API["Semgrep API"] --> PRO["Pro Engine + Rule Packs"]
EXTRACT --> GHCR["GHCR · digest-pinned OCI artifacts"]
PRO --> GHCR
end
subgraph "Bazel Analysis · bzlmod"
GHCR -->|"sha256 digest"| OCI["oci_archive repository rule"]
OCI --> ENGINE["Platform-specific engine binary"]
ENGINE --> SELECT{"select()"}
SELECT -->|"linux_x86_64"| E1["engine-amd64"]
SELECT -->|"linux_aarch64"| E2["engine-arm64"]
SELECT -->|"macos_x86_64"| E3["engine-osx-x86_64"]
SELECT -->|"macos_aarch64"| E4["engine-osx-arm64"]
end
subgraph "Bazel Test Execution"
E1 & E2 & E3 & E4 --> CORE["semgrep-core · direct invocation"]
SRCS["Source files"] --> CORE
RULES["Rule YAML"] --> CORE
CORE -->|"cached until inputs change"| RESULT["Pass / Fail"]
RESULT -.->|"best-effort"| APP["Semgrep App upload"]
endKey Decisions
| Decision | Rationale |
|---|---|
Bypass Python wrapper, invoke semgrep-core directly | Eliminates 2-4s Python startup per invocation |
| Vendor engine as OCI artifact (not pip) | Content-addressed digest pinning; platform-specific binaries; no pip resolution |
no-sandbox Bazel tag | semgrep-core needs real filesystem paths; sandbox adds ~100x overhead |
| Aspect for transitive source collection | Walks the real dependency graph for cross-file --pro analysis |
| Required credentials (GHCR_TOKEN + SEMGREP_APP_TOKEN) | Missing credentials are build/test errors — Pro interfile analysis runs everywhere |
| Gazelle auto-generation | Zero-maintenance BUILD files; orphan detection ensures no coverage gaps |
| Per-rule-file execution with post-scan ID filtering | File-level exclusion is O(1); rule-ID exclusion handles granular suppressions |
Results
| Metric | Before (managed CI) | After (Bazel + BuildBuddy) |
|---|---|---|
| Diff scan (cached) | 2m+ | 30s |
| Full scan (new rules) | 5m+ | 50s |
| Cold cache (all tests + images + semgrep) | N/A | 4m |
| Determinism | Non-deterministic (registry fetches) | Hermetic (digest-pinned) |
| Cache invalidation | Time-based / none | Content-addressed (source + rule hash) |
References
- semgrep-core CLI — OCaml engine invoked directly, bypassing the Python
pysemgrepwrapper - semgrep-core ATD target format — JSON
["Targets",[["CodeTarget",{...}]]]schema used for-targetsflag; seesemgrep-test.shfor generation - Semgrep Pro Engine —
semgrep-core-proprietarybinary enabling cross-file analysis via--proflag - Semgrep App API (scan upload) — 3-step REST flow: register scan → upload findings → mark complete (used by
upload.py) - GHCR OCI artifact layout — engine binaries stored as content-addressed OCI artifacts at
ghcr.io/<org>/semgrep-*, pulled viaoci_archiveBazel repository rules with SHA-256 digest pinning