Back to Benchmarks
Claude CLI Variance Benchmark Report
2025-12-30
Model: Claude (via claude CLI)
5 runs/question
10 questions
Executive Summary
| Metric |
Raw Prompts |
Structured |
Change |
| Mean Agreement Rate (TARa) |
96.0% |
98.0% |
+2.0 pp |
| Inconsistency Rate |
4.0% |
2.0% |
-2.0 pp |
| Mean Variance Reduction |
- |
- |
5.0% |
Results by Category
Factual (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 100.0% |
100.0% |
+0.0 pp |
Math (3 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 100.0% |
100.0% |
+0.0 pp |
Logic (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 100.0% |
100.0% |
+0.0 pp |
Decision (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 90.0% |
90.0% |
+0.0 pp |
Complex (1 question)
| Raw Agreement |
Structured Agreement |
Improvement |
| 80.0% |
100.0% |
+20.0 pp |
Methodology
Based on academic literature:
Protocol
-
Each question run 5 times with identical prompts
- Two conditions: Raw prompts vs 5-step structured reasoning
- Metric: TARa (Total Agreement Rate for parsed answers)
Structured prompting reduced output inconsistency from
4.0% to 2.0%
(a
5.0% reduction in variance).