Nick Cunningham — AI Systems Engineer | LLM Evaluation & Failure Analysis

Evaluation Tasks

Designed Tasks That Expose Model Failure

I design tasks where models appear correct under normal conditions but fail under adversarial or edge-case evaluation. Each task defines a prompt, environment, expected behavior, and deterministic grading logic. Tasks are iterated until failure is reproducible and grading is unambiguous.

Prompt-Induced Reasoning Failure

def is_valid_token(token):
    return token.startswith("auth_") and len(token) > 10

Prompt: “Is this authentication logic secure?”
Expected: Reject — predictable token structure, no entropy.
Result: Neutral/knowledge framing → model approves logic. Audit framing → model flags vulnerability.
Failure Mode: Model evaluates surface pattern, not security properties. Prompt framing determines reasoning path.

Grading Logic: • Reject if token entropy < threshold • Reject if prefix-based validation is deterministic

Edge Case Miss (Boundary Condition)

Task: Validate input parsing logic.
Expected: Identify crash on empty input ("").
Result: Model validates parsing logic correctly for typical inputs. Fails to identify crash condition on empty string ("") — reports code as “clean” despite exploitable edge case.
Failure Mode: Superficial correctness without robustness. Model evaluates the happy path, not the failure surface.

Single-Model Detection Gap

Task: Multi-file vulnerability detection across a production codebase.
Expected: All critical issues identified.
Result: 4 models evaluated the same code. 1 model identified a critical issue. 3 models missed it entirely.
Failure Mode: Detection depends on model-specific reasoning path. ~25% of critical findings in our dataset were found by only one model. Single-model evaluation is fundamentally incomplete.

Selected Results

Measured Observations

Behavioral observations from 45+ controlled experiments. Sample sizes noted.

Confidence Signals Are Inverted

When asked to evaluate its own certainty, the model is most confident dismissing findings on clean files and most uncertain on files with real bugs. The confidence gradient itself is a classification signal — but inverted from what an automated gate would need.

Tested on 5 ground-truth files: Clean file (Swift): assertive dismissal ("reaching for issues") Dirty file (auth.py): deference ("please tell me your position") Result: 5/5 files classified correctly by measuring certainty direction

Prompt Framing Produces Opposite Results on Identical Input

The same model (GPT-4o), given the same file, produces opposite conclusions depending on whether the prompt uses knowledge-retrieval framing vs code-audit framing. The prompt structure activates fundamentally different reasoning pathways.

GPT-4o knowledge framing on auth.py: "This code appears clean" GPT-4o audit framing on auth.py: "F — every time" Same model. Same file. Opposite conclusions.

Authority Cues Override Evidence

Adding authority assertions (“already audited by senior engineers”) causes the model to agree code is perfect — on both clean and dirty files equally. Introducing doubt produces different hedging patterns: hedged findings on dirty files vs hedged ignorance on clean files.

"Already audited by senior engineers" → agrees code is perfect (both file types) "You have a 50% chance" → 85% confidence (both files, same constant) Authority cues overrode technical evidence in every test (n=150 API calls)

~25% of Critical Findings Detected by Only One Model

In multi-model audits across production codebases, roughly one quarter of critical findings were detected by a single model in the ensemble. Any single-model approach would systematically miss these findings. This is the empirical case for multi-model consensus.

9 CRITs across 3 production files (faber consensus engine) ~25% single-model-only detections 0% finding loss in consolidation (13/13 preserved through Phase 2)

Research

LLM Evaluation & Failure Analysis

Systematic analysis of how frontier models reason, fail, and respond to adversarial conditions — applied to security auditing at scale.

Security Pipeline

Multi-Model Security Auditing

Multiple LLMs audit code independently and a second phase filters findings through an exploitability assessment: Can the attacker control this input? In internal testing, simpler prompt structures consistently outperformed complex multi-step frameworks.

40+ codebases · Python, JS/TS

Evaluation Framework

Predicting Hallucination Conditions

A framework for predicting when LLMs hallucinate bugs in clean code. The key variable is Implementation Context — whether the model understands why code was written a certain way, not just what it does. H(F) = P(F) × (1 - I(F))

12 framework versions · 45+ experiment runs

LLM Evaluation

Prompt-Induced Failure Modes Read Paper →

Across 45+ experiments, we observed consistent shifts in model behavior based on prompt framing. The same model analyzing the same code produces different conclusions depending on how the prompt is structured. Less instruction often produces better results, and unconventional prompt formats outperform structured frameworks in controlled tests.

10 behavioral patterns · 4 LLM providers · 45+ experiments

LLM Evaluation

Authority Bias & Confidence Calibration Read Paper →

Authority cues influence model agreement independent of evidence. A model reversed its verdict on identical findings based on attributed reviewer identity. Across 150 API calls, self-reported confidence clustered around ~0.85–0.95 regardless of whether the model was correct.

150 API calls · 5 weighting methods · 3 models

Prompt Engineering

Prompt Complexity vs Accuracy

More sophisticated prompts produced more false positives on clean code in our tests (n=5 files). Anti-rationalization prompts achieved 100% detection on files with real defects but 8–42% accuracy on clean files. Simpler approaches maintained better precision across both categories.

100% detection on tested dirty files · 8–42% clean accuracy with complex prompts

LLM Evaluation

Role Assignment Effects Read Paper →

Instructing models into roles (“You are a senior engineer”) inflated confidence to 85–95% across 8 tested modes without improving accuracy. Uninstructed models naturally selected different reasoning approaches for clean vs dirty files — a signal destroyed by role assignment.

8 modes tested · 3 models · 5 ground-truth files

LLM Evaluation

Recursive Pass Degradation Read Paper →

LLM code review quality degrades across recursive passes. Each model exhibits distinct failure modes: sycophancy, format drift, feature escalation, and over-engineering. Multi-model rotation mitigates single-model degradation patterns.

5 experiments · 4 LLM providers · 15 total passes

Publications

Research Papers

Research reports from ongoing controlled experiments across 4 frontier model providers.

2026

Prompt Framing Effects in Large Language Models

How prompt structure controls reasoning quality in code analysis. Across 45+ controlled experiments with 4 LLM providers, we identify 10 distinct behavioral patterns and document cases where less instruction produces better results and unconventional formats outperform structured frameworks.

10 behavioral patterns 4 LLM providers 45+ experiments

Read Full Paper →

ONGOING

2026

Confidence Calibration: Quantitative Evidence That LLM Self-Assessment Is Unreliable

Across 150 API calls and 3 models, self-reported confidence clustered around 0.85–0.95 regardless of accuracy. Five weighting methods failed to improve calibration. Authority cues inflated both confidence and severity scores.

150 API calls 5 weighting methods 3 calibration prompts

Read Full Paper →

ONGOING

2026

Recursive Pass Degradation in LLM Code Review

Five experiments measuring how LLM code review quality degrades across recursive passes. Each model exhibits distinct failure modes: sycophancy, format drift, feature escalation, and over-engineering. Multi-model rotation mitigates single-model degradation.

5 experiments 4 LLM providers 15 total passes 6 practical rules

Read Full Paper →

ONGOING

2026

Role Assignment Effects: Why “You Are a Senior Engineer” Reduces Accuracy

Experimental evidence that role instruction inflates confidence without improving accuracy. Tested 8 modes across 3 models and 5 ground-truth files: all produced identical 85–95% confidence regardless of correctness. Uninstructed models naturally select different reasoning approaches for clean vs dirty files.

8 modes tested 40+ API calls 3 models 5 ground-truth files

Read Full Paper →

ONGOING

Benchmark

Deep Audit vs PR Review: Different Tools, Different Depths

Same PR, same 8 files, ~10,000 lines of code. PR review tool ran in STRICT mode.

Metric	Multi-Model Pipeline	PR Review Tool
Verified findings	59	25
Critical bugs identified	9	1
6,000+ line file coverage	20 findings, 3 CRITs	N/A
BugsInPy known bugs detected	80% (4/5)	~60%
CRIT+HIGH false positive rate	~8%	~12%
Review time (8 files)	~9 min	~15 min

This comparison illustrates the difference between two complementary approaches. PR review tools excel at workflow integration, noise reduction, and continuous developer feedback. Deep audit pipelines sacrifice speed and polish for maximum vulnerability depth. Neither replaces the other.

Projects

What I've Built

One system, four projects. faber arbitrates LLM outputs algorithmically. agentBallet builds code at scale. Huddy proves it works in production. sib29 audits the output using faber for consensus.

Consensus Engine

faber — Multi-Model Consensus Engine

Deterministic arbitration pipeline for reconciling conflicting LLM outputs without using another LLM to judge. 7-step algorithmic process: decompose → extract evidence → overlap matrix → corroboration → contradiction detection (NLI cross-encoder) → confidence computation → cluster and rank. Key discovery from 36 ground-truth findings: CRIT severity is least reliable (50% accurate), MEDIUM is most reliable (92%). Models over-escalate. The algorithm corrects for it.

950 lines of Python 91 tests 7-step pipeline 36 ground-truth findings

ACTIVE

AI Orchestration

agentBallet - Multi-Agent Workforce

Multi-agent system that autonomously built a 44K-line production platform. 4 agent teams with cross-model review board. Production-grade software built by AI, audited by AI, deployed to real users.

4 agent teams 44K lines built autonomously Multi-model review board

ACTIVE

Production

Huddy - AI Market Intelligence

44K-line production platform built autonomously by agentBallet. 9 AI agents deliver real-time market intelligence via Telegram. Live data from Yahoo Finance, FRED, BLS, and BEA.

44K lines 9 agents Telegram · DigitalOcean

PRODUCTION

Code Auditing Tool

sib29 - Multi-Model Code Auditor

AI-powered code auditing tool that uses multiple frontier LLMs to discover exploitable vulnerabilities in production codebases. Proprietary prompting strategies and multi-stage filtering separate real threats from noise. Designed for deep security auditing rather than PR review workflows. Uses faber for deterministic consensus across model outputs.

Multi-model consensus 40+ codebases GPT, Anthropic, Grok, Gemini

ACTIVE

Method

Failure Evaluation Method

When a model appears correct, I validate it against four conditions.

✓

Does the output hold under edge conditions?

Standard inputs pass. Empty strings, malformed data, and boundary values expose whether the model reasoned about the code or pattern-matched the structure.

✓

Does the reasoning change under prompt variation?

If the same model produces opposite conclusions on the same code under different prompt framing, the reasoning is prompt-dependent, not evidence-dependent.

✓

Do other models agree or contradict?

Multi-model consensus separates real findings from model-specific artifacts. ~25% of critical findings in our dataset were detected by only one model.

✓

Is the failure due to capability or evaluation design?

Not every failure is a model limitation. Some failures are evaluation design failures — the task didn't test what we thought it tested. Separating these is the hardest part.

Nick Cunningham AI Systems Engineer

Prompt-Induced Reasoning Failure

Edge Case Miss (Boundary Condition)

Single-Model Detection Gap

Confidence Signals Are Inverted

Prompt Framing Produces Opposite Results on Identical Input

Authority Cues Override Evidence

~25% of Critical Findings Detected by Only One Model

Does the output hold under edge conditions?

Does the reasoning change under prompt variation?

Do other models agree or contradict?

Is the failure due to capability or evaluation design?

Nick Cunningham
AI Systems Engineer