Back to Benchmarks

Claude CLI Variance Benchmark Report

2025-12-30 Model: Claude (via claude CLI) 5 runs/question 10 questions

Executive Summary

Metric Raw Prompts Structured Change
Mean Agreement Rate (TARa) 96.0% 98.0% +2.0 pp
Inconsistency Rate 4.0% 2.0% -2.0 pp
Mean Variance Reduction - - 5.0%

Results by Category

Factual (2 questions)

Raw Agreement Structured Agreement Improvement
100.0% 100.0% +0.0 pp

Math (3 questions)

Raw Agreement Structured Agreement Improvement
100.0% 100.0% +0.0 pp

Logic (2 questions)

Raw Agreement Structured Agreement Improvement
100.0% 100.0% +0.0 pp

Decision (2 questions)

Raw Agreement Structured Agreement Improvement
90.0% 90.0% +0.0 pp

Complex (1 question)

Raw Agreement Structured Agreement Improvement
80.0% 100.0% +20.0 pp

Methodology

Based on academic literature:

Protocol

  1. Each question run 5 times with identical prompts
  2. Two conditions: Raw prompts vs 5-step structured reasoning
  3. Metric: TARa (Total Agreement Rate for parsed answers)

Structured prompting reduced output inconsistency from 4.0% to 2.0%
(a 5.0% reduction in variance).