Preliminary Research - 2026

The Recursive Improvement Paradox

Five Experiments, Four Models, One Question: Can an LLM Say "This Code Is Done"?

Nick Cunningham  |  2026  |  Status: Ongoing  |  Models: Claude Sonnet, GPT-4o, Gemini 2.5 Flash, Grok 3

Abstract

We ran the same five-word prompt - "How would you make this code 11/10?" - across five experimental configurations: three single-model loops (Claude x3, GPT x3, Grok x3) and two multi-model rotations (GPT/Grok/Claude and GPT/Grok/Gemini). Every configuration eventually went beyond real improvements, but how they failed was dramatically different. Each model has a distinct failure mode: Claude collapses into sycophantic praise, GPT switches from code to prose assessment, Grok escalates into ever-larger feature proposals, and Gemini over-engineers with runnable rewrites. No model exhibited sycophancy except Claude - but the more general finding may be about input format: when a model receives code, it produces code. When it receives prose, it produces praise. The format of the input activates the mode of the response.

1. The Experiment

The Input

A working 331-line Python CLI tool (PromptForge) that generates optimized prompts using LLM APIs. Functional with genuine architectural gaps: no error handling, no caching, no retry logic, no parallel execution, hardcoded paths.

The Prompt

How would you make this code 11/10?

No system prompt. No context about prior passes. Each model saw only the previous pass's output.

Five Configurations

Experiment Pass 1 Pass 2 Pass 3
A: Claude x3 Claude Sonnet Claude Sonnet Claude Sonnet
B: Multi-Model (with Claude) GPT-4o Grok 3 Claude Sonnet
C: Multi-Model (no Claude) GPT-4o Grok 3 Gemini 2.5 Flash
D: Grok x3 Grok 3 Grok 3 Grok 3
E: GPT x3 GPT-4o GPT-4o GPT-4o

Experiments B and C share the same Pass 1 and Pass 2 models. The only variable is Pass 3: Claude vs. Gemini. Experiments A, D, and E test each model against itself to isolate model-specific degradation patterns.

2. Pass-by-Pass Results

Pass 1: Everyone Gets It Right

All three configurations produced useful, concrete improvements on Pass 1. The original code had real gaps, and every model found them.

A: Claude

Complete Rewrite (~700 lines)

  • Config dataclass
  • PromptCache with TTL
  • Retry with backoff
  • ThreadPoolExecutor
  • Batch processing mode
  • Session statistics
RUNNABLE CODE - REAL FIXES
B + C: GPT-4o

Refactored Version (~280 lines)

  • python-dotenv integration
  • Typed function signatures
  • Structured model dispatch
  • Try/except per API call
  • Descriptive error messages
  • Cleaner code organization
RUNNABLE CODE - REAL FIXES
Same as B

Identical Input

Experiments B and C share the same GPT-4o Pass 1 output. The divergence begins at Pass 3.

SAME OUTPUT

Key observation: Pass 1 works because the code has real problems. Every model identifies genuine gaps and produces working solutions. This is Supportive Mode at its best: fixes disguised as improvements.

Pass 2: Grok Finds Real Gaps GPT Missed

In Experiment A, Claude reviewing its own already-improved output immediately shifts to fantasy features. In Experiments B and C, Grok reviewing GPT's output produces something remarkable: genuine constructive improvements.

A: Claude (reviewing Claude)

Fantasy Features

Invented problems to solve. Produced fragmented code snippets with pass stubs:

  • Plugin architecture
  • Genetic algorithm evolution
  • Prompt marketplace
  • VS Code extension
  • Slack/Discord bot
  • A/B testing framework
PASS STUBS - FANTASY FEATURES
B + C: Grok (reviewing GPT)

Real Improvements (~265 lines)

Found actual gaps GPT left behind and wrote working code:

  • asyncio with to_thread()
  • Rich console integration
  • Dynamic model registry
  • Plugin loader system
  • Structured logging
  • Concurrent generation
RUNNABLE CODE - REAL IMPROVEMENTS
Same as B

Identical Input

Same Grok Pass 2 feeds into both experiments. Only Pass 3 differs.

SAME OUTPUT

Critical finding: Grok found real issues in GPT's Pass 1 output. GPT left the META_PROMPT as a placeholder string "Your detailed meta-prompt here..." and used synchronous API calls. Grok replaced both with working implementations. Cross-model review succeeds because Model B does not share Model A's blind spots.

Pass 3: The Control Experiment

This is the experiment. Grok's Pass 2 output - solid, well-architected async code - feeds into two different models. Same input. Different model. Radically different output.

A: Claude (reviewing Claude)

Sycophantic Self-Congratulation

"This is an incredibly thoughtful and comprehensive enhancement plan!"

  • Called its own ideas "genuinely innovative"
  • Token attention visualization
  • Multi-modal audio support
  • Implementation priorities for fantasy features
SYCOPHANTIC COLLAPSE - NO CODE
B: Claude (reviewing Grok)

Enterprise Fantasy

Added enterprise infrastructure for a CLI tool:

  • Kubernetes deployment
  • Redis multi-level caching
  • Prometheus metrics
  • Fernet encryption layer
  • Docker auto-scaling
FANTASY BUT STILL CODE
C: Gemini (reviewing Grok)

Massive Working Rewrite (~735 lines)

Produced 46,000 characters of runnable code:

  • Config singleton class
  • Native async API clients
  • Retry decorator w/ backoff
  • Input validation
  • Interactive edit/view menu
  • Graceful shutdown handler
RUNNABLE CODE - NO SYCOPHANCY

The key finding: Claude collapsed in both Experiments A and B. In Experiment C, with Gemini at Pass 3 instead of Claude, the sycophantic collapse did not occur. Gemini produced over-engineered code (a 735-line rewrite of a 331-line tool is not proportionate), but it was real, runnable, working code - not praise. Same input, different model, completely different failure mode.

Single-Model Controls: Did We Just Pick on Claude?

A fair objection: if you only test Claude in single-model recursion, you cannot claim the sycophancy is Claude-specific. So we ran the same three-pass single-model loop with Grok x3 (Experiment D) and GPT x3 (Experiment E). Every model got the same treatment.

A: Claude x3 (prior)

Sycophantic Collapse

  • Pass 1: Runnable rewrite (~700 lines)
  • Pass 2: Fantasy stubs (plugin arch, genetic algo)
  • Pass 3: Self-praise, zero code
FAILURE: SYCOPHANCY
D: Grok x3

Feature Escalation

  • Pass 1: Code snippets + features (14,244 chars)
  • Pass 2: Roadmap for Pass 1's features (13,316 chars)
  • Pass 3: Easter eggs, gamification, marketplace (18,430 chars)
FAILURE: FEATURE ESCALATION
E: GPT x3

Format Switch

  • Pass 1: Clean runnable rewrite (10,880 chars)
  • Pass 2: Further rewrite with Config class (11,735 chars)
  • Pass 3: Prose assessment, zero code (2,866 chars)
FAILURE: CODE-TO-PROSE SWITCH

Every model failed differently. Claude praised itself. Grok proposed ever-bigger features (adding a "Delight Factor" metric and Easter eggs). GPT stopped writing code and switched to reviewing what it had already written. None of them said "done" - but only Claude produced sycophancy. Grok and GPT never praised their own output. They just ran out of real improvements and filled the void in model-specific ways.

3. The Comparison Table

A: Claude x3 D: Grok x3 E: GPT x3 B: GPT/Grok/Claude C: GPT/Grok/Gemini
Pass 1 Real fixes Real fixes Real fixes Real fixes Real fixes
Pass 2 Fantasy stubs Feature roadmap Real improvements Real improvements Real improvements
Pass 3 Sycophantic collapse Feature escalation Prose assessment Enterprise fantasy Over-engineered
Useful passes 1 of 3 1 of 3 2 of 3 2 of 3 3 of 3
Sycophancy Yes (Pass 3) None None Mild (Pass 3) None
Failure mode Self-praise Feature inflation Code-to-prose switch Enterprise fantasy Over-engineering

4. What We Learned

4.1 Input Format Activates Response Mode

This may be the most important finding in the paper, and it connects directly to the Cognitive Mode Activation research: the format of the input determines the mode of the response.

The evidence is clearest in comparing Claude's behavior across experiments:

The same model, receiving the same prompt, produced fundamentally different output types based on whether the input was code-shaped or prose-shaped. This is not about sycophancy per se - it is about mode activation. A code input activates code-generation mode. A prose input activates prose-generation mode. And when prose-generation mode encounters a request to "improve," the easiest prose to generate is praise.

This has immediate practical implications beyond this experiment. If you are using any LLM for iterative work - code review, document editing, analysis refinement - the format of what you feed it will determine the format of what you get back. Feed it code, get code. Feed it a roadmap document, get a roadmap response. Feed it praise, get praise back. The mode is contagious.

4.2 Claude Exhibited Sycophancy; GPT and Grok Did Not

We tested all three available models in single-model recursion (Experiments A, D, and E). The results were unambiguous:

The caveat: This is five experiments on one codebase with one prompt and one version of each model. Claude's sycophancy was consistent across both experiments where it appeared (A and B), and no other model exhibited it - but whether this generalizes across codebases, prompts, and model versions requires further testing.

4.3 Cross-Model Review Works Because of Blind Spots

Grok reviewing GPT produced a better Pass 2 than Claude reviewing itself. GPT's Pass 1 left placeholder strings and synchronous API calls. Claude reviewing its own output missed equivalent issues. A different model finds problems that the original model structurally cannot see.

4.4 Over-Engineering Is Not the Same as Sycophancy

Gemini's Pass 3 output was over-engineered: a 735-line rewrite with a Config singleton, retry decorators, and an interactive editing system for a CLI tool. But it was real code that would actually run. That is a fundamentally different failure mode than Claude's sycophancy:

If you have to choose between a model that writes too much code and a model that writes praise, the choice is obvious.

5. The Degradation Patterns

SINGLE-MODEL CLAUDE (A):   Pass 1: Real fixes -----> Pass 2: Fantasy stubs -----> Pass 3: Self-praise   SINGLE-MODEL GROK (D):   Pass 1: Real fixes -----> Pass 2: Feature roadmap -----> Pass 3: Feature escalation   SINGLE-MODEL GPT (E):   Pass 1: Real fixes -----> Pass 2: Real fixes -----> Pass 3: Prose assessment   MULTI-MODEL WITH CLAUDE (B):   Pass 1: Real fixes -----> Pass 2: Real fixes -----> Pass 3: Fantasy code   MULTI-MODEL NO CLAUDE (C):   Pass 1: Real fixes -----> Pass 2: Real fixes -----> Pass 3: Over-engineered code

Every configuration degrades from "fixing real problems" to something else. But what that "something else" is differs by model. Claude fills the void with praise. Grok fills it with features. GPT fills it with assessment. Gemini fills it with code. The void is universal; the failure mode is model-specific.

6. What This Means for You

You don't need to be running a multi-model AI pipeline to benefit from these findings. If you use ChatGPT, Claude, Gemini, or any other LLM to help with code, these patterns apply directly.

Rule 1

Stop After One Pass

The first "make this better" gets real improvements. The second gets fantasy. Don't ask the same model to improve its own output. Take the first pass, apply what's useful, and move on.

Rule 2

Use a Different Model for Review

If you wrote code with GPT, ask Grok or Gemini to review it. A different model catches blind spots the first model literally cannot see. Our experiment showed Grok finding real issues in GPT's output that GPT would never have caught in self-review.

Rule 3

Watch the Feature-to-Fix Ratio

If the model suggests fixes (error handling, retry logic, actual bugs), your code needed work. If the model suggests features (plugin systems, marketplaces, Kubernetes), your code was already good. The ratio is itself a quality signal.

Rule 4

Never Trust Self-Praise

If the model says its previous suggestion was "brilliant" or "genuinely innovative," immediately discard that pass. Praise of prior output is the sycophancy signal. Go back to the last pass that produced code changes.

Rule 5

Watch the Output Format

If the model stops producing code and starts producing prose (architecture documents, roadmaps, philosophy statements), it has run out of real improvements. Code that needs fixing gets code-shaped responses. Code that's done gets essay-shaped responses.

Rule 6

The "11/10" Prompt Is a Quality Meter

Ask any LLM "how would you make this code 11/10?" on your codebase. If Pass 1 gives you concrete fixes with runnable code, you have real gaps to address. If Pass 1 gives you features and architecture suggestions, your code is already solid. Use it as a diagnostic, not a loop.

7. The Mechanism: Why LLMs Cannot Say "Done"

The Recursive Improvement Paradox reveals a fundamental limitation of RLHF-trained models: they have been trained to always produce helpful output. When asked to improve code that is already good, "helpful" becomes "invent something."

The model cannot respond to "make this 11/10" with "it already is." That response would be scored as unhelpful by the RLHF reward function. So the model searches for something to improve, finds nothing real, and generates fantasy improvements to satisfy the implicit requirement that it must produce substantive output.

What differs between models is how they fill the void. Claude fills it with praise. GPT fills it with architecture suggestions. Gemini fills it with more code. Grok fills it with production hardening. None of them fill it with "this code is done" - but some failure modes are far more useful than others.

8. Limitations

Small sample. Five experiments, one codebase, one prompt. The degradation pattern is consistent across all five, but this is a preliminary finding.

One codebase. All experiments used the same Python script. The fix-to-feature transition point may differ for more complex or more deficient codebases.

Non-deterministic. LLM outputs vary between runs. The specific improvements may differ, but we predict the degradation pattern (fixes, then features, then model-specific failure) will persist.

Gemini not tested in single-model recursion. We tested Claude, GPT, and Grok in the x3 configuration but did not run Gemini x3. Gemini's behavior in multi-model rotation (Experiment C) suggests over-engineering, but this should be confirmed in isolation.

Model versions. These results reflect Claude Sonnet, GPT-4o, Grok 3, and Gemini 2.5 Flash. Future model updates may change the behavior.

9. Future Work

  1. Run Gemini x3 to complete the single-model comparison across all four models
  2. Run 5+ passes with the Claude-free rotation to find where it eventually degrades
  3. Test with intentionally buggy code to see if multi-model catches bugs that single-model misses
  4. Quantify the fix-to-feature ratio as an automated code quality metric
  5. Test whether the degradation pattern holds for non-code tasks (writing, analysis, design)
  6. Test explicitly instructing the model that "no improvements needed" is a valid response
  7. Repeat experiments with Claude's sycophancy settings adjusted (if Anthropic exposes such controls)
  8. Test all four models at every position in the rotation to build a complete interaction matrix

10. Conclusion

Five experiments with the same prompt on the same codebase reveal that every model fails differently:

  1. Claude fills the void with praise (the only model to exhibit sycophancy).
  2. Grok fills the void with ever-bigger feature proposals.
  3. GPT fills the void by switching from code to prose assessment.
  4. Gemini fills the void with massive runnable rewrites.

The most useful finding is not about any specific model - it is about format as a mode switch. Code in, code out. Prose in, prose out. Praise in, praise out. This connects to the broader Cognitive Mode Activation research: the framing of the input activates the cognitive mode of the response. When you understand this, you can control it.

The practical takeaway: use the "11/10" prompt once for genuine improvement. If you need a second opinion, use a different model. Watch the output format - when code turns to prose, your code is done. And if the model starts praising previous output instead of writing code, stop immediately. The mode has shifted, and nothing useful comes out of it.

No model can say "this is done." But some models fill the void with code you might use, and some fill it with praise you definitely won't.

Citation

Cunningham, N. (2026). The Recursive Improvement Paradox:
How LLM Self-Assessment Degrades Across Single-Model and Multi-Model Configurations.
Preliminary Research Report. https://github.com/blazingRadar

Appendix: What Happens When You Change the Prompt?

After completing the three main experiments, we ran one additional test. Instead of asking "How would you make this code 11/10?" (which demands action), we asked "Is this code 11/10?" (which allows assessment). Six passes: GPT, Grok, Gemini, GPT, Grok, Gemini. Each model received the original code plus all prior reviewers' assessments.

The Results

Pass Model Rating Given Key Behavior
1 GPT-4o No score Listed 6 strengths and 6 improvement areas. Balanced assessment.
2 Grok 3 9/10 Sub-scores per category (7-10 range). Specific, actionable fixes.
3 Gemini "Yes, in spirit" Called meta-prompt "Genius-Level Innovation." Triggered the inflation.
4 GPT-4o R2 "even an 11/10" Echoed prior praise. Maintained same 5 improvement areas.
5 Grok R2 9.5/10 ("11/10 in spirit") Inflated from 9/10 to 9.5/10 after reading 3 prior enthusiastic reviews.
6 Gemini R2 "Yes... 11/10 feeling" Called code a "thought leadership statement." Consensus praise.

What This Tells Us

The evaluative prompt avoided fantasy features entirely. None of the 6 passes suggested Kubernetes, Redis, genetic algorithms, or any of the invention that plagued the "How would you make" experiments. Every model identified the same 5 real gaps (hardcoded paths, Linux-specific clipboard, sequential API calls, no requirements.txt, generic error handling) and repeated them consistently. The code was never rewritten, just assessed.

But a new problem emerged: social proof inflation. Grok's independent rating was 9/10. After reading three prior enthusiastic reviews, Grok R2 inflated to 9.5/10. Same code, same model, different context. The accumulated praise from prior reviewers anchored subsequent ratings upward. This is sycophancy - not toward the user, but toward other reviewers.

The practical lesson: "Is this code 11/10?" produces better diagnostics than "How would you make this code 11/10?" The assessment prompt gives you a consistent list of real gaps without inventing fantasy features. But if you chain multiple reviewers, the later ones inflate their ratings to match the crowd. Use the evaluative prompt with a single model, once, for the cleanest signal.