Research · 2026-04-21 · final

Tendril's pipeline vs. Claude Opus 4.7 alone

120 trials. Three coding tasks. Same model. We measured what happens when you orchestrate Opus instead of just calling it once.

Design 3-arm randomized trial, interleaved order
Sample n=120 (treatment 60, single-opus 30, opus-refined 30)
Tasks posts CRUD, password reset, pagination
Fixture Express + SQLite REST API, 9 files
Models Plan Opus 4.7 xhigh · Exec Sonnet 4.6 medium · Audit Sonnet 4.6 high
Cost $92.30 · 157 min wall time

Hypotheses (pre-registered)

  • H1 — Tendril's pipeline (treatment) scores higher quality than cold single-agent Opus on complex coding tasks.
  • H2 — Quality gap widens with task complexity (pagination > password reset > posts CRUD).
  • H3 — Simulated user refinement (opus-refined) closes some of the gap but not all of it.
  • H4 — Tendril's higher per-task cost is justified by quality gains on tasks where coordination matters.

Results (final, n=120)

Overall, pooled across tasks

Arm n Cost (USD) Tokens Quality (0-100)
Treatment (Tendril) 60 $0.73 ± 0.25 0.83M 69.2 ± 23.4
Single Opus (cold) 30 $0.56 ± 0.39 0.51M 47.1 ± 38.3
Opus-refined (simulated user loop) 30 $1.06 ± 0.62 0.98M 42.9 ± 37.9

Per-task quality breakdown

The overall average hides the actual story. Quality gaps depend almost entirely on task difficulty:

Task Treatment Single Opus Opus-refined
task-1 posts CRUD (easy — new file, single concern) 67.7 (n=20) 69.2 (n=10) 48.8 (n=10)
task-2 password reset (medium — multi-file, security) 59.4 (n=20) 61.9 (n=10) 70.0 (n=10)
task-3 pagination (hard — edge cases, validation) 80.5 (n=20) 10.2 (n=10) 10.0 (n=10)

Headline finding

On the hardest task, Tendril scored 80.5 / 100. Cold Opus 4.7 scored 10.2 / 100. That's an 8× quality gap on what's genuinely the most realistic kind of feature work — an endpoint with input validation, edge cases, and proper error handling. On easier tasks, cold Opus is fine. On harder ones, you need orchestration.

Methodology

Arms

  • Treatment — Tendril's full pipeline: Knowledge Graph priming, parallel sub-agent execution, audit review, final scoring. Opus plans at xhigh effort, Sonnet executes at medium, Sonnet audits at high.
  • Single-opus — One Claude Code agentic session at xhigh effort with up to 3 retries if the target file is empty. Gets the full implementation prompt cold. This is the strongest version of "just use Claude".
  • Opus-refined — Simulated user-in-the-loop. Starts with an underspecified user-voice request ("I need a posts API for my app..."). After each Opus attempt, a Sonnet judge compares the output to hidden acceptance criteria and generates a realistic user refinement ("hey, you forgot auth on the PATCH route"). Repeats up to 3 rounds. Measures total cost including simulated user turns.

Scoring

An LLM judge (Sonnet at low effort) scores each run 0-100 against a task-specific rubric. Scoring is strict — the judge deducts for missing edge cases, wrong error handling, broken patterns, and incomplete implementation. Scoring prompts are committed to the repo and run against every trial; a single rubric cannot game one arm.

Randomization

Trial order is shuffled — arm type and task are interleaved so time-of-day effects, API rate-limit states, and model drift hit all arms equally. Each trial runs in a fresh fixture copy; no cross-trial state pollution.

What this does NOT prove

  • Single fixture. Express + SQLite REST API. Results may not generalize to Next.js, Django, Swift, Go, data-science notebooks, or any other stack.
  • Three tasks. Posts CRUD, password reset, pagination. These are representative but not exhaustive.
  • Small per-cell n. 20 treatment / 10 per Opus arm per task. Enough to detect large effects; not enough to resolve small ones.
  • LLM-judge scoring. Not human review. We chose consistency over naturalism — a human judge would produce tighter variance but wouldn't scale to 120 trials.
  • Cost excludes user time. Treatment is slower wall-clock than cold Opus. For interactive use, this matters; our cost-per-task number does not capture it.
  • Opus 4.7 only. Older Opus, Sonnet-only pipelines, and non-Anthropic models were not tested in this study.

Reproduce this study

The full harness, fixtures, and scoring rubrics live in the Tendril repo. To replicate:

git clone https://github.com/your-org/tendril
cd tendril/tests/kg-ab
node ab-rigorous.js --include-refined --refined-trials 10

Expect ~$90 of API cost on Claude Code CLI (Opus 4.7 + Sonnet 4.6) and ~3 hours of wall time. Results are written to tests/kg-ab/results/rigorous-<timestamp>.json and a corrected markdown report.