Research¶

Paper¶

Snodo: Protocol-Governed Software Development with AI Agents

Under review at ACM Transactions on Software Engineering and Methodology (TOSEM). arXiv link pending acceptance.

Claims → code map¶

The paper's architectural claims are grounded in the codebase. Each invariant and enforcement mechanism maps to a specific module and test suite.

Paper claim	Code module	Tests
WF1 — Mode separation (disjoint tool sets)	`snodo/compiler/verifier.py:check_wf1()`	`tests/compiler/test_models.py`
WF2 — Role uniqueness	`snodo/compiler/verifier.py:check_wf2()`	`tests/compiler/test_models.py`
WF3 — Validator coverage	`snodo/compiler/verifier.py:check_wf3()`	`tests/engine/test_loop.py`
WF4 — Policy completeness	`snodo/compiler/verifier.py:check_wf4()`	`tests/compiler/test_models.py`
WF5 — Constraint consistency	`snodo/compiler/verifier.py:check_wf5()`	`tests/compiler/test_models.py`
INV1 — JWT token integrity	`snodo/infrastructure/tokens.py`	`tests/properties/test_invariants.py`
INV2 — Mode boundary (MCP servers)	`snodo/mcp/server.py`	`tests/mcp/test_server.py`
INV3 — Blocker unconditional halt	`snodo/engine/policy.py:113-123`	`tests/engine/test_policy.py`
INV4 — Hash-chained audit	`snodo/infrastructure/audit.py`	`tests/infrastructure/test_session.py`
INV5 — Session resumability	`snodo/infrastructure/session.py`	`tests/infrastructure/test_session.py`
2+N reference protocol	`snodo/protocols/templates/2+n.yml`	`tests/e2e/test_2plus_n_journey.py`
Kleene closure (recursive subtasks)	`snodo/engine/loop.py`	`tests/engine/test_loop.py`

Empirical studies¶

The Wave 8 studies (studies/) produce the paper's Section 5 figures:

Study	Section	What it shows
policy_monte_carlo	5.1	False-block vs false-pass trade-off curves across the four disagreement policies, using the real PolicyEvaluator
detection_probability	5.2	Monte Carlo validation of the paper's quorum-miss formula and structural-vs-behavioral failure rate bounds
byzantine_robustness	5.3	Adversarial validator sensitivity — quantifying the design cost of INV3's unconditional blocker semantics
overhead_benchmarks	5.4	Real governance overhead: per-operation latency + end-to-end governed-vs-ungoverned comparison

All studies are deterministic (seeded RNG) and produce paper-styled SVG figures + CSV data. Run with make studies from the repository root.

Hypothesis property tests¶

130,000+ examples tested under Hypothesis with zero invariant violations committed to the audit log (studies/hypothesis_paper_run.log). Key properties: JWT tampering detection, WF1 tool disjointness, policy HALT on any blocker, and protocol templates pass all WF1-WF5 checks.