Skip to content

ADR 005: tOyCaml -- a representative demonstrator for the OCaml ruleset

Author: Joe McGinley Status: Accepted Created: 2026-06-10


Problem

ADR 004 commits to scaling bazel/ocaml toward building a real OCaml analysis engine, with the public Semgrep engine (github.com/semgrep/semgrep) as the reference target. The scaling plan's acceptance tests, however, are generic (examples/hello, examples/regex, examples/c_stubs). Generic examples do not exercise the build features a real engine actually needs, and they hide ordering constraints between those features (for instance, that a codegen tool must work before any tree-sitter binding compiles).

We want a target that is:

  1. Representative -- it mirrors the shape of how the public engine uses the OCaml ecosystem (a generic AST, lexing/parsing, pattern matching with metavariables, C stubs, a real opam dependency), so each ruleset capability is proven against something engine-like.
  2. Small and green -- it builds in seconds on today's ruleset and stays green as features land, so it can be the standing acceptance target.
  3. Reference-free in the toy itself -- the example code names no product; the mapping to the public engine lives here in the ADR.

The gap between "builds the toy" and "builds the engine" is a fixed list of build features, all visible in the public repo. tOyCaml exists to drive them out one at a time.

Decision

Add tOyCaml at bazel/ocaml/examples/toycaml: a deliberately tiny "grep for code" (parse a pattern and a target expression; structurally match the pattern, with metavariables, against the target). It is wired to mirror the public engine's ecosystem usage and grows feature-by-feature to cover the load-bearing build features that engine requires. JS/WASM is out of scope (see below).

tOyCaml becomes the ruleset's representative acceptance target: as each new capability lands, the matching toy component is upgraded to use it, so the ruleset always has an engine-shaped, green target to build against. Phase 8 of the scaling plan (the real engine) then reduces to "tOyCaml, but with hundreds of opam packages and dozens of grammars" -- the same mechanisms at volume.

What it mirrors today (buildable on the current ruleset)

tOyCaml componentEngine shape it stands in forRuleset feature exercised
tc_ast.ml/.mlia generic AST node typemulti-module library + .mli
tc_pattern.ml/.mli"a pattern is code with metavariables"intra-library dep
tc_lexer.ml/.mlithe lexing stagethe fetched-from-source re opam dep
tc_parse.ml/.mlithe parsing stageinter-module ordering (ocamldep -sort)
tc_matcher.ml/.mlithe matching engine coreinter-library dep on :toycaml_intern
tc_intern.ml/.mli + intern_stubs.cthe lone first-party C foreign-stubc_srcs
main.mlthe CLI entry pointocaml_binary + build_test
matcher_test.mlengine testsocaml_test

What it must grow to cover (the gap, all public)

Each item is a load-bearing build feature of the public engine, with where it is observable in github.com/semgrep/semgrep. None is in the current ruleset; each becomes a tOyCaml component as it lands.

FeatureWhere it shows in the public enginetOyCaml component to add
flambda compiler + -O3the root dune sets (ocamlopt_flags (:standard -O3)); semgrep.opam.locked pins ocaml-option-flambdabuild the whole toy with a flambda toolchain; a test asserts ocamlopt -config reports flambda
codegen tool as a build action (atdgen)many (rule (action (run atdgen ...))), including the tree-sitter OCaml runtime bindingsa small .atd generates a result type the matcher emits
visitors-style ppx + wider ppx setvisitors, ppx_deriving, ppx_sexp_conv, ppx_hash, ... in the dune treederive a traversal over the AST instead of hand-writing it
system-library vendoringconf-gmp, conf-libpcre, conf-libev, conf-zstd, ... in the lockfilea pcre regex rule; a gmp-backed bignum
non-dune opam package escape hatchzarith (hand-rolled configure + make, not dune) is in the closurebuild zarith via an override
static final linkmanylinux/musl-static wheels; the link flags are computed per platformstatically link the toycaml binary
submodule-fetched grammars~30 tree-sitter grammars are git submodulesfetch one grammar + its parser.c and bind it

Sequencing note this surfaces: atdgen is not an edge case -- it generates the tree-sitter OCaml runtime bindings, so it must work before the grammar/C-stub work, not after it. ADR 004's plan ordered it the other way; tOyCaml makes the dependency explicit.

JS/WASM is out of scope

The engine's native path (ocamlopt) is the target. The public playground ships a js_of_ocaml/wasm bundle, but that is a separate compilation backend (bytecode via ocamlc, then js_of_ocaml/wasm_of_ocaml) that the native-only driver cannot produce. We explicitly do not pursue it here; if it is ever needed it is a new ADR, not a tOyCaml component.

Alternatives Considered

  • Keep the generic examples. Rejected: they do not exercise the real feature set and hide ordering constraints (the atdgen-before-grammars inversion above only became obvious by mapping a representative target).
  • Go straight to the real engine (Phase 8 first). Rejected: too large to land green incrementally, and it forces the public/private and submodule/volume problems before the mechanisms are proven on something small.
  • Vendor the public engine into this repo as the example. Rejected: heavy, slow, and unnecessary -- mirroring the shape proves the mechanisms; volume is Phase 8's job.

Security

Baseline per docs/security.md. tOyCaml is original first-party code plus the already-pinned re dependency; it introduces no new fetched artifacts and no product source.

Risks

RiskLikelihoodImpactMitigation
The toy drifts from the real engine's shape over timeLowLowThis ADR's mapping table is the contract; revisit when Phase 8 starts
A feature is "demonstrated" on the toy but breaks at engine volumeMediumMediumThe toy proves mechanism, not scale; Phase 8 keeps its own re-plan checkpoint

References

ResourceRelevance
bazel/ocaml/examples/toycaml/README.mdthe component-to-feature map, runnable
docs/decisions/tooling/004-ocaml-rules-for-semgrep.mdthe scaling decision this serves
docs/plans/2026-06-10-ocaml-rules-semgrep-scale.mdthe phase plan tOyCaml is the acceptance target for
github.com/semgrep/semgrepthe public engine whose build shape is mirrored
docs/decisions/tooling/006-extensible-multiarch-ocaml-toolchains.mdhow the toy builds per-arch
docs/decisions/tooling/007-ocaml-build-file-generation-gazelle.mdhow the toy's BUILDs are generated at scale