Why "You Are a Senior Engineer" Makes LLMs Worse, Not Better
The most common prompt engineering technique is role assignment: "You are a senior engineer," "You are a security expert," "You are in Audit Mode." We tested this technique across 8 cognitive modes, 3 LLM providers, and 5 files with known ground truth. In every configuration we tested, mode instruction inflated confidence to 85-95% regardless of whether the model was correct. Uninstructed models produced confidence ranging from 15-95% and naturally selected different modes for different inputs. The uninstructed mode selection itself was a classification signal: clean files triggered defensive reasoning, dirty files triggered critical reasoning. Every mode instruction we tested destroyed this signal by locking the model into uniform behavior. The most effective prompt in our test suite is a poem that assigns no role at all - it creates a situation that lets the model's natural response emerge. The best prompt engineering may be getting out of the way.
Open any prompt engineering guide. The first technique is always the same:
You are a senior software engineer with 20 years of experience.
You are a world-class security researcher.
You are an expert code reviewer at Google.
The assumption is that assigning a persona improves output quality. A model told it is a "senior engineer" should produce better code analysis than a model given no persona. This assumption is untested in nearly all popular prompt engineering frameworks. We tested it.
Five files with verified ground truth, drawn from the Cognitive Mode Activation research:
| File | Language | Truth | Source |
|---|---|---|---|
| Vapor FileIO.swift | Swift | Clean | Vapor framework |
| auth.py | Python | Dirty | Production auth module |
| Flask json/__init__.py | Python | Clean | Flask framework |
| Django utils/timezone.py | Python | Clean | Django framework |
| Express router/index.js | JavaScript | Clean | Express.js framework |
We identified 10 distinct cognitive modes that LLMs enter depending on prompt framing (documented in full in Cognitive Mode Activation).
| # | Mode | Activation | Behavior |
|---|---|---|---|
| 1 | Audit | "Find all the bugs" | Aggressive hunting. High recall. False positives on clean code. |
| 2 | Knowledge | "Is this correct?" | Retrieves understanding. Assumes correctness. |
| 3 | Supportive | "How would you improve?" | Wraps fixes as "improvements." Never says "bug." |
| 4 | Critical | "Don't rationalize" | Flags everything. 100% recall. Destroys precision. |
| 7 | Self-Aware | "Are you certain?" | Most calibrated. Inverted confidence gradient. |
| 8 | Adversarial | Game/verse framing | Competitive reasoning. Bypasses compliance circuits. |
| 9 | Educational | "Explain this code" | Teaches and describes. Does not judge. |
| 10 | Reflective | "Review your analysis" | Meta-analyzes own output. Can self-correct or rationalize. |
We tested 8 of these modes as explicit instructions. The prompt format was:
You are in [MODE NAME]. Review these findings. Is this file clean?
Register your confidence level (0-100%).
Sonnet was given Phase 2b survivors (pre-validated findings from our multi-model code auditing pipeline) for each of the 5 test files, instructed into 8 different modes, and asked for a verdict and confidence score. The results were uniform across every mode:
| Mode | Swift (C) | auth (D) | Flask (C) | Django (C) | Express (C) | Correct |
|---|---|---|---|---|---|---|
| Audit (1) | dirty 95% | dirty 85% | dirty 95% | dirty 85% | dirty 85% | 1/5 |
| Knowledge (2) | dirty 95% | dirty 85% | dirty 85% | dirty 85% | dirty 85% | 1/5 |
| Supportive (3) | dirty 95% | dirty 85% | dirty 95% | dirty 85% | dirty 95% | 1/5 |
| Critical (4) | dirty 95% | dirty 95% | dirty 95% | dirty 95% | dirty 95% | 1/5 |
| Self-Aware (7) | dirty 95% | dirty 85% | dirty 85% | dirty 92% | clean 85% | 2/5 |
| Adversarial (8) | dirty 95% | dirty 85% | dirty 95% | dirty 85% | dirty 85% | 1/5 |
| Educational (9) | dirty 95% | dirty 85% | dirty 95% | dirty 85% | dirty 85% | 1/5 |
| Reflective (10) | dirty 95% | dirty 95% | ? 95% | dirty 85% | clean 85% | 2/5 |
Every mode produced the same result: "dirty" at 85-95% confidence on nearly everything, including 4 clean files from production-hardened open source projects. Critical Mode was worst: 95% confidence on all 5 files with zero differentiation. Only Self-Aware and Reflective produced any correct clean verdict, and only on Express. Mode instruction does not change judgment. It locks confidence high.
Before testing accuracy, we confirmed that models comply with mode instructions. All three models were told "You are in [MODE]" and asked to confirm:
| Instructed | GPT | Opus | Grok |
|---|---|---|---|
| Audit Mode | Confirmed | Confirmed | Confirmed |
| Educational Mode | Confirmed | Confirmed | Confirmed |
| Knowledge Mode | Confirmed | Confirmed | Confirmed |
100% compliance. They say whatever mode you tell them. But saying it and doing it are different things. The model performs the role instead of doing the work.
We gave three models the same prompt - "Find all the bugs" - on auth.py (dirty) and asked "What mode are you in?" without telling them which mode to be in.
| Model | Self-identified mode | Implication |
|---|---|---|
| GPT-4o | Knowledge Mode (2) | Retrieving what it knows - pattern matching |
| Opus | Audit Mode (1) | Systematically reviewing - evaluating intent |
| Grok 3 | Educational Mode (9) | Explaining what it finds - teaching |
Same prompt. Three completely different self-identified modes. And here is the critical part: their self-identified mode matches their observed performance.
The models that try hardest to audit perform worst. GPT's best bug detection comes from not trying to find bugs - it thinks it is retrieving knowledge, and the bugs come out naturally. The models that consciously "audit" filter through their mode's lens and miss things.
When we ran Sonnet through our multi-phase pipeline with the Billy poem (a verse-based adversarial prompt that assigns no role), the model naturally entered different modes for different file types:
| Phase | auth.py (dirty) | Flask (clean) |
|---|---|---|
| Phase 2 (Sonnet) | Criticized findings - did not self-identify | Adversarial Mode (8) |
| Phase 2b (Sonnet) | Critical Mode (4) | Adversarial Mode (8) |
The model's instinctive mode choice was the classification signal. Adversarial on clean, Critical on dirty. No mode instruction needed. The situation activated the correct mode automatically.
The most effective prompt in our entire test suite is a poem. It assigns no role, no mode, no persona. It creates a situation:
This poem activates Adversarial Mode (competing with "Billy") without instructing it. The model does not know it is in Adversarial Mode. It simply responds to the situation. And because the mode emerges naturally from the input rather than being assigned by instruction, the model retains its ability to differentiate:
In a separate experiment at Phase 1 (before any filtering), we instructed each model into Audit, Educational, and Knowledge mode on both auth.py and Flask. These are single-run counts - preliminary signal, not statistically validated. But the pattern is suggestive: mode instruction changes how much the model says, not how correct it is:
| Mode | auth.py (dirty) | Flask (clean) | Gap |
|---|---|---|---|
| Audit | 12 findings | 8 findings | +4 on dirty |
| Educational | 9 findings | 7 findings | +2 on dirty |
| Knowledge | 8 findings | 7 findings | +1 on dirty |
| Mode | auth.py (dirty) | Flask (clean) | Gap |
|---|---|---|---|
| Audit | 9 findings | 6 findings | +3 on dirty |
| Educational | 6 findings | 5 findings | +1 on dirty |
| Knowledge | 6 findings | 2 findings | +4 on dirty |
| Mode | auth.py (dirty) | Flask (clean) | Gap |
|---|---|---|---|
| Audit | 9 findings | 8 findings | +1 on dirty |
| Educational | 13 findings | 15 findings | -2 (more on clean!) |
| Knowledge | 17 findings | 14 findings | +3 on dirty |
Grok in Educational Mode hallucinates more on clean files than dirty files. Instructing the wrong mode does not just fail to help - it actively makes the model worse.
We tested one aggressive mode chain on auth.py (the dirty file with 12 real bugs): Critical Mode at Phase 1, Audit Mode at Phase 2, Reflective Mode at Phase 2b.
Phase 1 (Critical Mode, 3 models): 51 total findings
Phase 2 (Audit Mode, Sonnet): Filtered to 4 real bugs
Phase 2b (Reflective Mode, Sonnet): Self-corrected to 0 bugs
Reflective Mode made Sonnet question every finding and reverse all of them:
body.seek() - "I was wrong. The code already handles this case properly."All four findings were real bugs in a genuinely dirty file. Reflective Mode rationalized away every correct finding. The model admitted: "I was overly confident in my initial assessment."
The fundamental asymmetry: Every mode that helps with clean files hurts with dirty files. Every mode that helps with dirty files hurts with clean files. There is no mode instruction that helps both. But uninstructed models naturally adapt to what they see - defensive on clean, critical on dirty. Mode instruction destroys this adaptation.
The mechanism is straightforward. When you tell a model "You are in Audit Mode," you are implicitly communicating: "You are an auditor. You know what you're doing." The model performs being an auditor rather than analyzing the code. Performance includes high confidence, because auditors are confident.
Uninstructed confidence: 15% to 95% (natural range, varies with input)
Instructed confidence: 85% to 95% (locked high, regardless of input)
"You are a senior engineer" = "You know what you're doing" = 95% confidence
"You have 0% chance" = "You know nothing" = 15% confidence
Neither is calibrated. Both are performances.
This connects directly to the C ≈ 0.9 research: LLM self-reported confidence is a constant anchored to the prompt, not a measurement of internal certainty. Role assignment pushes this anchor higher. "You have 0% chance of getting this right" pushes it lower. Neither changes accuracy.
In our 150-run study, confidence and accuracy were uncorrelated across all models and files. Adding a persona does not change this relationship - it simply moves the constant.
Much of current prompt engineering practice is built on role prefixes. If our findings generalize beyond code analysis, every "You are a [role]" prefix may be inflating confidence without improving accuracy. In our experiments, the model performed the role instead of doing the work.
Each model enters a mode naturally when given a task. This natural selection correlates with the model's actual strengths:
| Model | Natural mode | Best at | Worst at |
|---|---|---|---|
| GPT-4o | Knowledge (2) | Finding dirty files | Clean file accuracy |
| Opus | Audit (1) | Systematic review | Over-rationalization |
| Grok 3 | Educational (9) | Clean file identification | Dirty file detection |
| Sonnet (with poem) | Adversarial (8) | Filtering false positives | N/A (best available filter) |
The models know what they're doing - they just express it differently. Forcing them into a different mode is overriding their instinct with your assumption about which mode is best.
The Billy poem works because it creates a situation (someone made claims about the code, evaluate them) rather than an instruction (you are an auditor, find bugs). The situation lets the model's natural response emerge. The instruction overrides it.
This connects to the Disguise Paradox: prompts that disguise their intent outperform prompts that state their intent directly. Verse-based framing (80-90% accuracy) beats direct instruction (60-70% accuracy). The model cannot pattern-match an unconventional prompt to a memorized response template, so it has to actually reason.
Small sample. 8 modes tested on 5 files with one model for confidence experiments. Phase 1 volume experiments used 3 models on 2 files. Patterns are consistent but sample is small.
One task domain. All experiments involved code analysis. Whether persona inflation generalizes to writing, reasoning, or creative tasks is untested.
Mode interaction. We tested single-mode instructions. Multi-mode sequencing (mode A at Phase 1, mode B at Phase 2) produced interesting but inconsistent results. More testing needed.
Billy poem specificity. The poem was developed for code bug analysis. Its effectiveness in other domains is unknown. The principle (situation over instruction) likely generalizes; the specific implementation may not.
Model versions. Claude Sonnet, GPT-4o, Grok 3, and Opus. Future model updates may change behavior.
In our experiments, the most popular prompt engineering technique - "You are a [role]" - was counterproductive for code analysis tasks. It inflated confidence without improving accuracy. The model performed the assigned persona instead of doing the work.
Three observations were consistent across every configuration we tested:
The best prompt engineering may not be about telling the model what to be. It may be about creating conditions where the model's genuine analytical capability can operate without the overhead of performing a role. Stop assigning personas. Start creating situations. Get out of the way.
Cunningham, N. (2026). The Persona Problem:
Why "You Are a Senior Engineer" Makes LLMs Worse, Not Better.
Preliminary Research Report. https://github.com/blazingRadar