Skip to content

Hermetic Semgrep via Bazel

Author: Joe McGinley Status: Accepted Created: 2026-03-04 See also: rules_semgrep README


Problem

Agentic coding workflows (Claude Code, autonomous agents) need CI feedback in seconds, not minutes. Semgrep on managed CI infrastructure took 2m+ for diff scans and 5m+ for full scans — too slow for tight agent iteration loops.

Beyond speed, the standard Semgrep integration (pip install semgrep + registry rule fetches) breaks determinism. pip resolution is non-hermetic. Rule registry pulls vary across runs. The Python wrapper adds 2-4s startup overhead per invocation. Agents need identical results from identical inputs.

Bazel's content-addressed cache model is the answer: tests re-run only when their inputs change. But Semgrep has no native Bazel integration, and the engine ships as a Python wheel — not a standalone binary.

Proposal

Three-layer solution: (1) vendor semgrep-core OCaml binary as OCI artifacts on GHCR, bypassing the Python wrapper entirely; (2) Bazel rules that wire engine + rules + sources into cacheable sh_test targets; (3) a Gazelle extension that auto-generates scan targets from the dependency graph.

mermaid
graph TD
    subgraph "Daily Update · GitHub Actions"
        PYPI["PyPI Wheels"] --> EXTRACT["Extract semgrep-core binary"]
        API["Semgrep API"] --> PRO["Pro Engine + Rule Packs"]
        EXTRACT --> GHCR["GHCR · digest-pinned OCI artifacts"]
        PRO --> GHCR
    end

    subgraph "Bazel Analysis · bzlmod"
        GHCR -->|"sha256 digest"| OCI["oci_archive repository rule"]
        OCI --> ENGINE["Platform-specific engine binary"]
        ENGINE --> SELECT{"select()"}
        SELECT -->|"linux_x86_64"| E1["engine-amd64"]
        SELECT -->|"linux_aarch64"| E2["engine-arm64"]
        SELECT -->|"macos_x86_64"| E3["engine-osx-x86_64"]
        SELECT -->|"macos_aarch64"| E4["engine-osx-arm64"]
    end

    subgraph "Bazel Test Execution"
        E1 & E2 & E3 & E4 --> CORE["semgrep-core · direct invocation"]
        SRCS["Source files"] --> CORE
        RULES["Rule YAML"] --> CORE
        CORE -->|"cached until inputs change"| RESULT["Pass / Fail"]
        RESULT -.->|"best-effort"| APP["Semgrep App upload"]
    end

Key Decisions

DecisionRationale
Bypass Python wrapper, invoke semgrep-core directlyEliminates 2-4s Python startup per invocation
Vendor engine as OCI artifact (not pip)Content-addressed digest pinning; platform-specific binaries; no pip resolution
no-sandbox Bazel tagsemgrep-core needs real filesystem paths; sandbox adds ~100x overhead
Aspect for transitive source collectionWalks the real dependency graph for cross-file --pro analysis
Required credentials (GHCR_TOKEN + SEMGREP_APP_TOKEN)Missing credentials are build/test errors — Pro interfile analysis runs everywhere
Gazelle auto-generationZero-maintenance BUILD files; orphan detection ensures no coverage gaps
Per-rule-file execution with post-scan ID filteringFile-level exclusion is O(1); rule-ID exclusion handles granular suppressions

Results

MetricBefore (managed CI)After (Bazel + BuildBuddy)
Diff scan (cached)2m+30s
Full scan (new rules)5m+50s
Cold cache (all tests + images + semgrep)N/A4m
DeterminismNon-deterministic (registry fetches)Hermetic (digest-pinned)
Cache invalidationTime-based / noneContent-addressed (source + rule hash)

References

  • semgrep-core CLI — OCaml engine invoked directly, bypassing the Python pysemgrep wrapper
  • semgrep-core ATD target format — JSON ["Targets",[["CodeTarget",{...}]]] schema used for -targets flag; see semgrep-test.sh for generation
  • Semgrep Pro Enginesemgrep-core-proprietary binary enabling cross-file analysis via --pro flag
  • Semgrep App API (scan upload) — 3-step REST flow: register scan → upload findings → mark complete (used by upload.py)
  • GHCR OCI artifact layout — engine binaries stored as content-addressed OCI artifacts at ghcr.io/<org>/semgrep-*, pulled via oci_archive Bazel repository rules with SHA-256 digest pinning