Tendril's pipeline vs. Claude Opus 4.7 alone
120 trials. Three coding tasks. Same model. We measured what happens when you orchestrate Opus instead of just calling it once.
Hypotheses (pre-registered)
- H1 — Tendril's pipeline (treatment) scores higher quality than cold single-agent Opus on complex coding tasks.
- H2 — Quality gap widens with task complexity (pagination > password reset > posts CRUD).
- H3 — Simulated user refinement (opus-refined) closes some of the gap but not all of it.
- H4 — Tendril's higher per-task cost is justified by quality gains on tasks where coordination matters.
Results (final, n=120)
Overall, pooled across tasks
| Arm | n | Cost (USD) | Tokens | Quality (0-100) |
|---|---|---|---|---|
| Treatment (Tendril) | 60 | $0.73 ± 0.25 | 0.83M | 69.2 ± 23.4 |
| Single Opus (cold) | 30 | $0.56 ± 0.39 | 0.51M | 47.1 ± 38.3 |
| Opus-refined (simulated user loop) | 30 | $1.06 ± 0.62 | 0.98M | 42.9 ± 37.9 |
Per-task quality breakdown
The overall average hides the actual story. Quality gaps depend almost entirely on task difficulty:
| Task | Treatment | Single Opus | Opus-refined |
|---|---|---|---|
| task-1 posts CRUD (easy — new file, single concern) | 67.7 (n=20) | 69.2 (n=10) | 48.8 (n=10) |
| task-2 password reset (medium — multi-file, security) | 59.4 (n=20) | 61.9 (n=10) | 70.0 (n=10) |
| task-3 pagination (hard — edge cases, validation) | 80.5 (n=20) | 10.2 (n=10) | 10.0 (n=10) |
Headline finding
On the hardest task, Tendril scored 80.5 / 100. Cold Opus 4.7 scored 10.2 / 100. That's an 8× quality gap on what's genuinely the most realistic kind of feature work — an endpoint with input validation, edge cases, and proper error handling. On easier tasks, cold Opus is fine. On harder ones, you need orchestration.
Methodology
Arms
- Treatment — Tendril's full pipeline: Knowledge Graph priming, parallel sub-agent execution, audit review, final scoring. Opus plans at xhigh effort, Sonnet executes at medium, Sonnet audits at high.
- Single-opus — One Claude Code agentic session at xhigh effort with up to 3 retries if the target file is empty. Gets the full implementation prompt cold. This is the strongest version of "just use Claude".
- Opus-refined — Simulated user-in-the-loop. Starts with an underspecified user-voice request ("I need a posts API for my app..."). After each Opus attempt, a Sonnet judge compares the output to hidden acceptance criteria and generates a realistic user refinement ("hey, you forgot auth on the PATCH route"). Repeats up to 3 rounds. Measures total cost including simulated user turns.
Scoring
An LLM judge (Sonnet at low effort) scores each run 0-100 against a task-specific rubric. Scoring is strict — the judge deducts for missing edge cases, wrong error handling, broken patterns, and incomplete implementation. Scoring prompts are committed to the repo and run against every trial; a single rubric cannot game one arm.
Randomization
Trial order is shuffled — arm type and task are interleaved so time-of-day effects, API rate-limit states, and model drift hit all arms equally. Each trial runs in a fresh fixture copy; no cross-trial state pollution.
What this does NOT prove
- Single fixture. Express + SQLite REST API. Results may not generalize to Next.js, Django, Swift, Go, data-science notebooks, or any other stack.
- Three tasks. Posts CRUD, password reset, pagination. These are representative but not exhaustive.
- Small per-cell n. 20 treatment / 10 per Opus arm per task. Enough to detect large effects; not enough to resolve small ones.
- LLM-judge scoring. Not human review. We chose consistency over naturalism — a human judge would produce tighter variance but wouldn't scale to 120 trials.
- Cost excludes user time. Treatment is slower wall-clock than cold Opus. For interactive use, this matters; our cost-per-task number does not capture it.
- Opus 4.7 only. Older Opus, Sonnet-only pipelines, and non-Anthropic models were not tested in this study.
Reproduce this study
The full harness, fixtures, and scoring rubrics live in the Tendril repo. To replicate:
git clone https://github.com/your-org/tendril
cd tendril/tests/kg-ab
node ab-rigorous.js --include-refined --refined-trials 10
Expect ~$90 of API cost on Claude Code CLI (Opus 4.7 + Sonnet 4.6)
and ~3 hours of wall time. Results are written to
tests/kg-ab/results/rigorous-<timestamp>.json and a
corrected markdown report.