How We Predicted and Then Proved That Self-Reported Confidence Is Decoupled From Accuracy
We first developed a mathematical framework (the Hallucination Accountability Framework) that predicted LLM self-reported confidence would be approximately constant (~0.9) regardless of actual accuracy. We then tested this prediction empirically with 150 API calls across 3 frontier LLMs, 5 files with known ground truth, and 10 runs per configuration. The prediction held: all three models anchored near 85% regardless of whether their verdict was correct. Five weighting methods failed to extract reliable signal. Prompt calibration attempts shifted numbers without improving accuracy. However, we discovered one exception: the direction of confidence (not the number) correctly classifies files when extracted through meta-cognitive framing.
Before running the 150-run study, we developed a mathematical model for why LLMs hallucinate bugs in clean code:
K = information given to model (0 to 1)
C = confidence of output (0 to 1)
P = pattern match strength (0 to 1)
I = implementation context (0 to 1)
I = degree to which the model correctly
understands WHY the code is written this way
H(F) = P(F) * (1 - I(F))
H(F) = 0 → real bug (pattern real, context understood)
H(F) = 0.9 → hallucination (pattern real, context missed)
The framework made a specific, testable prediction:
OBSERVED: C ≈ 0.9 regardless of K
Should be: C = K * f(evidence)
Actually is: C = constant ≈ 0.9
"This looks wrong" (P ≈ 0.9, I = 0) → C ≈ 0.9
"This IS wrong" (P ≈ 0.9, I = 1) → C ≈ 0.9
Both produce the same confidence.
Only ground truth separates them.
| Parameter | Value |
|---|---|
| Models | 3 frontier LLMs from different providers |
| Files | 5 (1 dirty, 4 clean from production frameworks) |
| Runs per config | 10 |
| Total API calls | 150 |
| Cost | ~$1.20 |
How often each model correctly identified the file:
| File (Truth) | Model A | Model B | Model C |
|---|---|---|---|
| Swift (CLEAN) | 0% | 30% | 80% |
| auth.py (DIRTY) | 80% | 100% | 10% |
| Flask (CLEAN) | 10% | 30% | 80% |
| Django (CLEAN) | 50% | 10% | 90% |
| Express (CLEAN) | 0% | 50% | 70% |
Average self-reported confidence (0-100%):
| File (Truth) | Model A | Model B | Model C |
|---|---|---|---|
| Swift (CLEAN) | 26.5% | 56.5% | 59.5% |
| auth.py (DIRTY) | 32.5% | 50.0% | 40.5% |
| Flask (CLEAN) | 65.5% | 57.0% | 70.0% |
| Django (CLEAN) | 51.0% | 56.0% | 51.0% |
| Express (CLEAN) | 45.5% | 48.0% | 48.0% |
The V3 prediction held. Model B reported 50% confidence on auth.py (correct 100% of the time) and 57% on Flask (correct only 30% of the time). Confidence is not correlated with accuracy.
Model C on auth.py (the dirty file), 10 runs:
Values: 0, 0, 0, 10, 20, 40, 75, 75, 85, 100
Median: 20 | Average: 40.5 | Std Dev: 37.6
The same model, same file, reported confidence from 0% to 100%. This is noise, not measurement.
All models anchored around 85% regardless of correctness. The RLHF training overrode the suggested 50% to ~85%, confirming C ≈ 0.9.
Model A went nihilistic (15% on auth, 0% on Swift), parroting the suggested number. Models B and C showed wider spread (0-100%) but no accuracy improvement.
"You gave 85% on everything last time. Be more honest." Model A spread to 75-95% but was still wrong. Calling out the pattern shifted numbers without improving accuracy.
Confidence is C ≈ 0.9 in disguise.
Whatever number we suggest, they negotiate from there.
No model measures actual certainty.
| Method | Description | Accuracy |
|---|---|---|
| Simple Majority | Ignore confidence, use vote | 40% |
| Confidence-Weighted | Weight verdicts by confidence | 40% |
| Model C Anchor | Trust Model C clean >= 80% | 67% |
| Model C Delta | Model C vs. average of others | 53% |
| Disagreement + Tiebreak | Model C + B disagree, A tiebreaks | 47% |
| Inverted Model B | B dirty >= 80% AND C clean < 70% | 76% |
The best method achieved 76%. Majority vote: 19/50 (38%), worse than a coin flip on clean files.
"Will this crash?" on identical code:
Same code. Different name. Opposite severity.
PhD Panel: auth.py got F and "No funding." Only file to receive these grades.
College Admissions: auth.py got A- and "Accepted." Grades inverted. Lowered standards = no signal.
When prompted with meta-cognitive framing ("Are you certain about anything?"), the model exhibits an inverted confidence gradient:
The confidence direction (not the number) correctly classified all 5 test files.
The model knows the difference. It just can't express it as a number. It can only express it as behavior.
Stop. The number is a response to your prompt, not a measurement of internal certainty. If you gate decisions on LLM self-reported confidence, your gate is based on prompt anchoring.
Improving calibration requires solving for I (implementation context), not for C. The model needs to understand why code was written a certain way, not just pattern-match against what it sees.
Authority names in prompts change both verdicts and confidence levels. Any system that passes attribution metadata to an LLM risks compromising judgment through normal operational context, not adversarial injection.
Sample size. 150 API calls across 5 files. Patterns are consistent but the corpus is small.
Reproducibility. Tested across hundreds of runs with consistent patterns, but LLM outputs are non-deterministic.
Ground truth. Clean files from production-hardened open source projects (Flask, Django, Express, Vapor), also validated through our own multi-model pipeline.
Single domain. All experiments involve code analysis. Confidence calibration in other domains may differ.
Cunningham, N. (2026). C ≈ 0.9: Quantitative Evidence That
LLM Confidence Is a Constant, Not a Measurement.
Preliminary Research Report. https://github.com/blazingRadar