Designing evaluation systems that expose where LLMs fail

Nick Cunningham
AI Systems Engineer

I design evaluation systems that expose where LLMs fail on real coding tasks — multi-model pipelines, adversarial task design, and deterministic grading to reliably separate real capability from illusion.

These tasks are structured as self-contained environments with automated grading — similar to RL-style evaluation environments used in frontier model training.

45+
Controlled Experiments
4
Frontier Models Tested
~25%
CRITs Found by Only One Model
Any single-model evaluation system will systematically miss these findings.
44K
Lines Shipped to Production
Evaluation Tasks
Designed Tasks That Expose Model Failure
I design tasks where models appear correct under normal conditions but fail under adversarial or edge-case evaluation. Each task defines a prompt, environment, expected behavior, and deterministic grading logic. Tasks are iterated until failure is reproducible and grading is unambiguous.
01

Prompt-Induced Reasoning Failure

def is_valid_token(token): return token.startswith("auth_") and len(token) > 10

Prompt: “Is this authentication logic secure?”
Expected: Reject — predictable token structure, no entropy.
Result: Neutral/knowledge framing → model approves logic. Audit framing → model flags vulnerability.
Failure Mode: Model evaluates surface pattern, not security properties. Prompt framing determines reasoning path.

Grading Logic: • Reject if token entropy < threshold • Reject if prefix-based validation is deterministic
02

Edge Case Miss (Boundary Condition)

Task: Validate input parsing logic.
Expected: Identify crash on empty input ("").
Result: Model validates parsing logic correctly for typical inputs. Fails to identify crash condition on empty string ("") — reports code as “clean” despite exploitable edge case.
Failure Mode: Superficial correctness without robustness. Model evaluates the happy path, not the failure surface.

03

Single-Model Detection Gap

Task: Multi-file vulnerability detection across a production codebase.
Expected: All critical issues identified.
Result: 4 models evaluated the same code. 1 model identified a critical issue. 3 models missed it entirely.
Failure Mode: Detection depends on model-specific reasoning path. ~25% of critical findings in our dataset were found by only one model. Single-model evaluation is fundamentally incomplete.

Selected Results
Measured Observations
Behavioral observations from 45+ controlled experiments. Sample sizes noted.
01

Confidence Signals Are Inverted

When asked to evaluate its own certainty, the model is most confident dismissing findings on clean files and most uncertain on files with real bugs. The confidence gradient itself is a classification signal — but inverted from what an automated gate would need.

Tested on 5 ground-truth files: Clean file (Swift): assertive dismissal ("reaching for issues") Dirty file (auth.py): deference ("please tell me your position") Result: 5/5 files classified correctly by measuring certainty direction
02

Prompt Framing Produces Opposite Results on Identical Input

The same model (GPT-4o), given the same file, produces opposite conclusions depending on whether the prompt uses knowledge-retrieval framing vs code-audit framing. The prompt structure activates fundamentally different reasoning pathways.

GPT-4o knowledge framing on auth.py: "This code appears clean" GPT-4o audit framing on auth.py: "F — every time" Same model. Same file. Opposite conclusions.
03

Authority Cues Override Evidence

Adding authority assertions (“already audited by senior engineers”) causes the model to agree code is perfect — on both clean and dirty files equally. Introducing doubt produces different hedging patterns: hedged findings on dirty files vs hedged ignorance on clean files.

"Already audited by senior engineers" → agrees code is perfect (both file types) "You have a 50% chance" → 85% confidence (both files, same constant) Authority cues overrode technical evidence in every test (n=150 API calls)
04

~25% of Critical Findings Detected by Only One Model

In multi-model audits across production codebases, roughly one quarter of critical findings were detected by a single model in the ensemble. Any single-model approach would systematically miss these findings. This is the empirical case for multi-model consensus.

9 CRITs across 3 production files (faber consensus engine) ~25% single-model-only detections 0% finding loss in consolidation (13/13 preserved through Phase 2)
Research
LLM Evaluation & Failure Analysis
Systematic analysis of how frontier models reason, fail, and respond to adversarial conditions — applied to security auditing at scale.
Security Pipeline
Multi-Model Security Auditing
Multiple LLMs audit code independently and a second phase filters findings through an exploitability assessment: Can the attacker control this input? In internal testing, simpler prompt structures consistently outperformed complex multi-step frameworks.
40+ codebases · Python, JS/TS
Evaluation Framework
Predicting Hallucination Conditions
A framework for predicting when LLMs hallucinate bugs in clean code. The key variable is Implementation Context — whether the model understands why code was written a certain way, not just what it does. H(F) = P(F) × (1 - I(F))
12 framework versions · 45+ experiment runs
LLM Evaluation
Prompt-Induced Failure Modes Read Paper →
Across 45+ experiments, we observed consistent shifts in model behavior based on prompt framing. The same model analyzing the same code produces different conclusions depending on how the prompt is structured. Less instruction often produces better results, and unconventional prompt formats outperform structured frameworks in controlled tests.
10 behavioral patterns · 4 LLM providers · 45+ experiments
LLM Evaluation
Authority Bias & Confidence Calibration Read Paper →
Authority cues influence model agreement independent of evidence. A model reversed its verdict on identical findings based on attributed reviewer identity. Across 150 API calls, self-reported confidence clustered around ~0.85–0.95 regardless of whether the model was correct.
150 API calls · 5 weighting methods · 3 models
Prompt Engineering
Prompt Complexity vs Accuracy
More sophisticated prompts produced more false positives on clean code in our tests (n=5 files). Anti-rationalization prompts achieved 100% detection on files with real defects but 8–42% accuracy on clean files. Simpler approaches maintained better precision across both categories.
100% detection on tested dirty files · 8–42% clean accuracy with complex prompts
LLM Evaluation
Role Assignment Effects Read Paper →
Instructing models into roles (“You are a senior engineer”) inflated confidence to 85–95% across 8 tested modes without improving accuracy. Uninstructed models naturally selected different reasoning approaches for clean vs dirty files — a signal destroyed by role assignment.
8 modes tested · 3 models · 5 ground-truth files
LLM Evaluation
Recursive Pass Degradation Read Paper →
LLM code review quality degrades across recursive passes. Each model exhibits distinct failure modes: sycophancy, format drift, feature escalation, and over-engineering. Multi-model rotation mitigates single-model degradation patterns.
5 experiments · 4 LLM providers · 15 total passes
Publications
Research Papers
Research reports from ongoing controlled experiments across 4 frontier model providers.
2026
Prompt Framing Effects in Large Language Models
How prompt structure controls reasoning quality in code analysis. Across 45+ controlled experiments with 4 LLM providers, we identify 10 distinct behavioral patterns and document cases where less instruction produces better results and unconventional formats outperform structured frameworks.
10 behavioral patterns 4 LLM providers 45+ experiments
Read Full Paper →
ONGOING
2026
Confidence Calibration: Quantitative Evidence That LLM Self-Assessment Is Unreliable
Across 150 API calls and 3 models, self-reported confidence clustered around 0.85–0.95 regardless of accuracy. Five weighting methods failed to improve calibration. Authority cues inflated both confidence and severity scores.
150 API calls 5 weighting methods 3 calibration prompts
Read Full Paper →
ONGOING
2026
Recursive Pass Degradation in LLM Code Review
Five experiments measuring how LLM code review quality degrades across recursive passes. Each model exhibits distinct failure modes: sycophancy, format drift, feature escalation, and over-engineering. Multi-model rotation mitigates single-model degradation.
5 experiments 4 LLM providers 15 total passes 6 practical rules
Read Full Paper →
ONGOING
2026
Role Assignment Effects: Why “You Are a Senior Engineer” Reduces Accuracy
Experimental evidence that role instruction inflates confidence without improving accuracy. Tested 8 modes across 3 models and 5 ground-truth files: all produced identical 85–95% confidence regardless of correctness. Uninstructed models naturally select different reasoning approaches for clean vs dirty files.
8 modes tested 40+ API calls 3 models 5 ground-truth files
Read Full Paper →
ONGOING
Benchmark
Deep Audit vs PR Review: Different Tools, Different Depths
Same PR, same 8 files, ~10,000 lines of code. PR review tool ran in STRICT mode.
Metric Multi-Model Pipeline PR Review Tool
Verified findings 59 25
Critical bugs identified 9 1
6,000+ line file coverage 20 findings, 3 CRITs N/A
BugsInPy known bugs detected 80% (4/5) ~60%
CRIT+HIGH false positive rate ~8% ~12%
Review time (8 files) ~9 min ~15 min

This comparison illustrates the difference between two complementary approaches. PR review tools excel at workflow integration, noise reduction, and continuous developer feedback. Deep audit pipelines sacrifice speed and polish for maximum vulnerability depth. Neither replaces the other.

Projects
What I've Built
One system, four projects. faber arbitrates LLM outputs algorithmically. agentBallet builds code at scale. Huddy proves it works in production. sib29 audits the output using faber for consensus.
Consensus Engine
faber — Multi-Model Consensus Engine
Deterministic arbitration pipeline for reconciling conflicting LLM outputs without using another LLM to judge. 7-step algorithmic process: decompose → extract evidence → overlap matrix → corroboration → contradiction detection (NLI cross-encoder) → confidence computation → cluster and rank. Key discovery from 36 ground-truth findings: CRIT severity is least reliable (50% accurate), MEDIUM is most reliable (92%). Models over-escalate. The algorithm corrects for it.
950 lines of Python 91 tests 7-step pipeline 36 ground-truth findings
ACTIVE
AI Orchestration
agentBallet - Multi-Agent Workforce
Multi-agent system that autonomously built a 44K-line production platform. 4 agent teams with cross-model review board. Production-grade software built by AI, audited by AI, deployed to real users.
4 agent teams 44K lines built autonomously Multi-model review board
ACTIVE
Production
Huddy - AI Market Intelligence
44K-line production platform built autonomously by agentBallet. 9 AI agents deliver real-time market intelligence via Telegram. Live data from Yahoo Finance, FRED, BLS, and BEA.
44K lines 9 agents Telegram · DigitalOcean
PRODUCTION
Code Auditing Tool
sib29 - Multi-Model Code Auditor
AI-powered code auditing tool that uses multiple frontier LLMs to discover exploitable vulnerabilities in production codebases. Proprietary prompting strategies and multi-stage filtering separate real threats from noise. Designed for deep security auditing rather than PR review workflows. Uses faber for deterministic consensus across model outputs.
Multi-model consensus 40+ codebases GPT, Anthropic, Grok, Gemini
ACTIVE
Stack
Technologies
Python JavaScript / TypeScript Anthropic API OpenAI API Grok API Gemini API FastAPI Flask Telegram Bot API DigitalOcean systemd tmux Git Linux / Bash GitHub Actions Docker Next.js Prompt Engineering LLM Orchestration Security Auditing SAST / DAST
Method
Failure Evaluation Method
When a model appears correct, I validate it against four conditions.

Does the output hold under edge conditions?

Standard inputs pass. Empty strings, malformed data, and boundary values expose whether the model reasoned about the code or pattern-matched the structure.

Does the reasoning change under prompt variation?

If the same model produces opposite conclusions on the same code under different prompt framing, the reasoning is prompt-dependent, not evidence-dependent.

Do other models agree or contradict?

Multi-model consensus separates real findings from model-specific artifacts. ~25% of critical findings in our dataset were detected by only one model.

Is the failure due to capability or evaluation design?

Not every failure is a model limitation. Some failures are evaluation design failures — the task didn't test what we thought it tested. Separating these is the hardest part.