Research · 2026-04-21 · final

Tendril's pipeline vs. Claude Opus 4.7 alone

120 trials. Three coding tasks. Same model. We measured what happens when you orchestrate Opus instead of just calling it once.

Design 3-arm randomized trial, interleaved order

Sample n=120 (treatment 60, single-opus 30, opus-refined 30)

Tasks posts CRUD, password reset, pagination

Fixture Express + SQLite REST API, 9 files

Models Plan Opus 4.7 xhigh · Exec Sonnet 4.6 medium · Audit Sonnet 4.6 high

Cost $92.30 · 157 min wall time

Hypotheses (pre-registered)

H1 — Tendril's pipeline (treatment) scores higher quality than cold single-agent Opus on complex coding tasks.
H2 — Quality gap widens with task complexity (pagination > password reset > posts CRUD).
H3 — Simulated user refinement (opus-refined) closes some of the gap but not all of it.
H4 — Tendril's higher per-task cost is justified by quality gains on tasks where coordination matters.

Results (final, n=120)

Overall, pooled across tasks

Arm	n	Cost (USD)	Tokens	Quality (0-100)
Treatment (Tendril)	60	$0.73 ± 0.25	0.83M	69.2 ± 23.4
Single Opus (cold)	30	$0.56 ± 0.39	0.51M	47.1 ± 38.3
Opus-refined (simulated user loop)	30	$1.06 ± 0.62	0.98M	42.9 ± 37.9

Per-task quality breakdown

The overall average hides the actual story. Quality gaps depend almost entirely on task difficulty:

Task	Treatment	Single Opus	Opus-refined
task-1 posts CRUD (easy — new file, single concern)	67.7 (n=20)	69.2 (n=10)	48.8 (n=10)
task-2 password reset (medium — multi-file, security)	59.4 (n=20)	61.9 (n=10)	70.0 (n=10)
task-3 pagination (hard — edge cases, validation)	80.5 (n=20)	10.2 (n=10)	10.0 (n=10)

Headline finding

On the hardest task, Tendril scored 80.5 / 100. Cold Opus 4.7 scored 10.2 / 100. That's an 8× quality gap on what's genuinely the most realistic kind of feature work — an endpoint with input validation, edge cases, and proper error handling. On easier tasks, cold Opus is fine. On harder ones, you need orchestration.

Methodology

Arms

Treatment — Tendril's full pipeline: Knowledge Graph priming, parallel sub-agent execution, audit review, final scoring. Opus plans at xhigh effort, Sonnet executes at medium, Sonnet audits at high.
Single-opus — One Claude Code agentic session at xhigh effort with up to 3 retries if the target file is empty. Gets the full implementation prompt cold. This is the strongest version of "just use Claude".
Opus-refined — Simulated user-in-the-loop. Starts with an underspecified user-voice request ("I need a posts API for my app..."). After each Opus attempt, a Sonnet judge compares the output to hidden acceptance criteria and generates a realistic user refinement ("hey, you forgot auth on the PATCH route"). Repeats up to 3 rounds. Measures total cost including simulated user turns.

Scoring

An LLM judge (Sonnet at low effort) scores each run 0-100 against a task-specific rubric. Scoring is strict — the judge deducts for missing edge cases, wrong error handling, broken patterns, and incomplete implementation. Scoring prompts are committed to the repo and run against every trial; a single rubric cannot game one arm.

Randomization

Trial order is shuffled — arm type and task are interleaved so time-of-day effects, API rate-limit states, and model drift hit all arms equally. Each trial runs in a fresh fixture copy; no cross-trial state pollution.

What this does NOT prove

Single fixture. Express + SQLite REST API. Results may not generalize to Next.js, Django, Swift, Go, data-science notebooks, or any other stack.
Three tasks. Posts CRUD, password reset, pagination. These are representative but not exhaustive.
Small per-cell n. 20 treatment / 10 per Opus arm per task. Enough to detect large effects; not enough to resolve small ones.
LLM-judge scoring. Not human review. We chose consistency over naturalism — a human judge would produce tighter variance but wouldn't scale to 120 trials.
Cost excludes user time. Treatment is slower wall-clock than cold Opus. For interactive use, this matters; our cost-per-task number does not capture it.
Opus 4.7 only. Older Opus, Sonnet-only pipelines, and non-Anthropic models were not tested in this study.

Reproduce this study

The full harness, fixtures, and scoring rubrics live in the Tendril repo. To replicate:

git clone https://github.com/your-org/tendril
cd tendril/tests/kg-ab
node ab-rigorous.js --include-refined --refined-trials 10

Expect ~$90 of API cost on Claude Code CLI (Opus 4.7 + Sonnet 4.6) and ~3 hours of wall time. Results are written to tests/kg-ab/results/rigorous-<timestamp>.json and a corrected markdown report.