Skip to content

Research

Paper

Snodo: Protocol-Governed Software Development with AI Agents

Under review at ACM Transactions on Software Engineering and Methodology (TOSEM). arXiv link pending acceptance.

Claims → code map

The paper's architectural claims are grounded in the codebase. Each invariant and enforcement mechanism maps to a specific module and test suite.

Paper claim Code module Tests
WF1 — Mode separation (disjoint tool sets) snodo/compiler/verifier.py:check_wf1() tests/compiler/test_models.py
WF2 — Role uniqueness snodo/compiler/verifier.py:check_wf2() tests/compiler/test_models.py
WF3 — Validator coverage snodo/compiler/verifier.py:check_wf3() tests/engine/test_loop.py
WF4 — Policy completeness snodo/compiler/verifier.py:check_wf4() tests/compiler/test_models.py
WF5 — Constraint consistency snodo/compiler/verifier.py:check_wf5() tests/compiler/test_models.py
INV1 — JWT token integrity snodo/infrastructure/tokens.py tests/properties/test_invariants.py
INV2 — Mode boundary (MCP servers) snodo/mcp/server.py tests/mcp/test_server.py
INV3 — Blocker unconditional halt snodo/engine/policy.py:113-123 tests/engine/test_policy.py
INV4 — Hash-chained audit snodo/infrastructure/audit.py tests/infrastructure/test_session.py
INV5 — Session resumability snodo/infrastructure/session.py tests/infrastructure/test_session.py
2+N reference protocol snodo/protocols/templates/2+n.yml tests/e2e/test_2plus_n_journey.py
Kleene closure (recursive subtasks) snodo/engine/loop.py tests/engine/test_loop.py

Empirical studies

The Wave 8 studies (studies/) produce the paper's Section 5 figures:

Study Section What it shows
policy_monte_carlo 5.1 False-block vs false-pass trade-off curves across the four disagreement policies, using the real PolicyEvaluator
detection_probability 5.2 Monte Carlo validation of the paper's quorum-miss formula and structural-vs-behavioral failure rate bounds
byzantine_robustness 5.3 Adversarial validator sensitivity — quantifying the design cost of INV3's unconditional blocker semantics
overhead_benchmarks 5.4 Real governance overhead: per-operation latency + end-to-end governed-vs-ungoverned comparison

All studies are deterministic (seeded RNG) and produce paper-styled SVG figures + CSV data. Run with make studies from the repository root.

Hypothesis property tests

130,000+ examples tested under Hypothesis with zero invariant violations committed to the audit log (studies/hypothesis_paper_run.log). Key properties: JWT tampering detection, WF1 tool disjointness, policy HALT on any blocker, and protocol templates pass all WF1-WF5 checks.