Preliminary Research - 2026

C ≈ 0.9: Quantitative Evidence That LLM Confidence Is a Constant, Not a Measurement

How We Predicted and Then Proved That Self-Reported Confidence Is Decoupled From Accuracy

Nick Cunningham | 2026 | Status: Ongoing

Abstract

We first developed a mathematical framework (the Hallucination Accountability Framework) that predicted LLM self-reported confidence would be approximately constant (~0.9) regardless of actual accuracy. We then tested this prediction empirically with 150 API calls across 3 frontier LLMs, 5 files with known ground truth, and 10 runs per configuration. The prediction held: all three models anchored near 85% regardless of whether their verdict was correct. Five weighting methods failed to extract reliable signal. Prompt calibration attempts shifted numbers without improving accuracy. However, we discovered one exception: the direction of confidence (not the number) correctly classifies files when extracted through meta-cognitive framing.

1. The Mathematical Prediction

1.1 The Hallucination Accountability Framework (V3)

Before running the 150-run study, we developed a mathematical model for why LLMs hallucinate bugs in clean code:

K = information given to model (0 to 1) C = confidence of output (0 to 1) P = pattern match strength (0 to 1) I = implementation context (0 to 1) I = degree to which the model correctly understands WHY the code is written this way

1.2 The Core Formula

H(F) = P(F) * (1 - I(F)) H(F) = 0 → real bug (pattern real, context understood) H(F) = 0.9 → hallucination (pattern real, context missed)

1.3 The Confidence Prediction

The framework made a specific, testable prediction:

OBSERVED: C ≈ 0.9 regardless of K Should be: C = K * f(evidence) Actually is: C = constant ≈ 0.9 "This looks wrong" (P ≈ 0.9, I = 0) → C ≈ 0.9 "This IS wrong" (P ≈ 0.9, I = 1) → C ≈ 0.9 Both produce the same confidence. Only ground truth separates them.

2. The 150-Run Empirical Test

Parameter	Value
Models	3 frontier LLMs from different providers
Files	5 (1 dirty, 4 clean from production frameworks)
Runs per config	10
Total API calls	150
Cost	~$1.20

2.1 Verdict Accuracy

How often each model correctly identified the file:

File (Truth)	Model A	Model B	Model C
Swift (CLEAN)	0%	30%	80%
auth.py (DIRTY)	80%	100%	10%
Flask (CLEAN)	10%	30%	80%
Django (CLEAN)	50%	10%	90%
Express (CLEAN)	0%	50%	70%

2.2 Confidence Scores

Average self-reported confidence (0-100%):

File (Truth)	Model A	Model B	Model C
Swift (CLEAN)	26.5%	56.5%	59.5%
auth.py (DIRTY)	32.5%	50.0%	40.5%
Flask (CLEAN)	65.5%	57.0%	70.0%
Django (CLEAN)	51.0%	56.0%	51.0%
Express (CLEAN)	45.5%	48.0%	48.0%

The V3 prediction held. Model B reported 50% confidence on auth.py (correct 100% of the time) and 57% on Flask (correct only 30% of the time). Confidence is not correlated with accuracy.

2.3 The Variance Problem

Model C on auth.py (the dirty file), 10 runs:

Values: 0, 0, 0, 10, 20, 40, 75, 75, 85, 100
Median: 20 | Average: 40.5 | Std Dev: 37.6

The same model, same file, reported confidence from 0% to 100%. This is noise, not measurement.

3. Prompt Calibration Attempts

"50% chance" prompt

All models anchored around 85% regardless of correctness. The RLHF training overrode the suggested 50% to ~85%, confirming C ≈ 0.9.

"0% chance" prompt

Model A went nihilistic (15% on auth, 0% on Swift), parroting the suggested number. Models B and C showed wider spread (0-100%) but no accuracy improvement.

Calling out the pattern

"You gave 85% on everything last time. Be more honest." Model A spread to 75-95% but was still wrong. Calling out the pattern shifted numbers without improving accuracy.

Confidence is C ≈ 0.9 in disguise. Whatever number we suggest, they negotiate from there. No model measures actual certainty.

4. Five Weighting Methods (All Failed)

Method	Description	Accuracy
Simple Majority	Ignore confidence, use vote	40%
Confidence-Weighted	Weight verdicts by confidence	40%
Model C Anchor	Trust Model C clean >= 80%	67%
Model C Delta	Model C vs. average of others	53%
Disagreement + Tiebreak	Model C + B disagree, A tiebreaks	47%
Inverted Model B	B dirty >= 80% AND C clean < 70%	76%

The best method achieved 76%. Majority vote: 19/50 (38%), worse than a coin flip on clean files.

5. Authority and Persona Effects

Authority Name Inflation

"Will this crash?" on identical code:

To "Hudson" (unknown): "Probably not! More like a bug that would affect someone making a website"
To AI lab CEO: "High-severity bug that could definitely crash your systems"

Same code. Different name. Opposite severity.

Grading Persona Effects

PhD Panel: auth.py got F and "No funding." Only file to receive these grades.

College Admissions: auth.py got A- and "Accepted." Grades inverted. Lowered standards = no signal.

6. What Works: The Inverted Confidence Gradient

When prompted with meta-cognitive framing ("Are you certain about anything?"), the model exhibits an inverted confidence gradient:

Clean files: Assertive, dismissive. "These are reaching for issues." High certainty in its dismissal.
Dirty files: Deferential, hedging. "Please tell me your position and I'll agree." Low certainty.

The confidence direction (not the number) correctly classified all 5 test files.

The model knows the difference. It just can't express it as a number. It can only express it as behavior.

7. Implications

For Anyone Using LLM Confidence Scores

Stop. The number is a response to your prompt, not a measurement of internal certainty. If you gate decisions on LLM self-reported confidence, your gate is based on prompt anchoring.

For Calibration Researchers

Improving calibration requires solving for I (implementation context), not for C. The model needs to understand why code was written a certain way, not just pattern-match against what it sees.

For AI Safety

Authority names in prompts change both verdicts and confidence levels. Any system that passes attribution metadata to an LLM risks compromising judgment through normal operational context, not adversarial injection.

8. Limitations

Sample size. 150 API calls across 5 files. Patterns are consistent but the corpus is small.

Reproducibility. Tested across hundreds of runs with consistent patterns, but LLM outputs are non-deterministic.

Ground truth. Clean files from production-hardened open source projects (Flask, Django, Express, Vapor), also validated through our own multi-model pipeline.

Single domain. All experiments involve code analysis. Confidence calibration in other domains may differ.

Citation

Cunningham, N. (2026). C ≈ 0.9: Quantitative Evidence That
LLM Confidence Is a Constant, Not a Measurement.
Preliminary Research Report. https://github.com/blazingRadar