AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison
6 benchmarks combined, one tool scores 36-51% depending who tests it. 47% of developers use AI review but 96% don't trust it. The data nobody showed you.
Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.
AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison
Five different organizations published AI code review benchmarks in the past six months. Each one crowned a different winner. Greptile measured an 82% catch rate — for themselves. When Augment tested Greptile on the same five repositories, the number dropped to 45%. CodeRabbit scored 44% in one benchmark and 51.2% in another. Qodo claimed 60.1% F1, the highest published score, on their own test suite.
None of these numbers are wrong. They just measured different things, scored differently, and — in every case — the organization running the test happened to win.
We collected every published benchmark, put the data side by side, and found the patterns hiding underneath the contradictions. This page is the result: one comparison table, a breakdown of why results diverge, and the methodology for an open benchmark anyone can reproduce.
Every AI code review benchmark in one table
Nobody has combined these results before. Here they are, sorted by the organization that ran each test:
Martian Benchmark (February 2026) — Independent
The closest thing to a neutral evaluation. Martian was founded by researchers from DeepMind, Anthropic, and Meta. They open-sourced their dataset, judge prompts, and evaluation pipeline.
| Tool | F1 Score | Method |
|---|---|---|
| Qodo (multi-agent) | 60.1% | Offline + online |
| CodeRabbit | 51.2% | Offline + online |
| CodeAnt AI | 51.7% | Offline only |
Scale: Offline test on 50 PRs + online monitoring of real GitHub activity (Jan–Feb 2026). The online component tracked whether developers actually fixed code after AI comments — a behavioral signal that is harder to game than synthetic testing.
Scoring: A comment counted as useful if the developer changed code in response. Comments that developers ignored or dismissed scored against the tool.
Greptile Benchmark (2025) — Vendor
Greptile tested five tools on 50 PRs across five popular open-source repositories: Sentry (Python), Cal.com (TypeScript), Grafana (Go), Keycloak (Java), and Discourse (Ruby).
| Tool | Catch Rate |
|---|---|
| Greptile | 82% |
| Cursor | 58% |
| Copilot | 54% |
| CodeRabbit | 44% |
| Graphite | 6% |
What "catch rate" means here: A bug counted as "caught" only when the tool explicitly identified the faulty code in a line-level comment and explained the impact. Default settings were used for all tools — no custom rules.
The problem: When Augment later tested the same five repositories, Greptile scored 45% — not 82%. Same repos, dramatically different results. The gap reveals how much the definition of "caught" and the specific bugs chosen can swing results.
Augment Code Benchmark (2026) — Vendor
Augment tested seven tools on the same five repositories as Greptile but expanded the ground truth dataset with manual verification.
| Tool | Precision | Recall | F1 |
|---|---|---|---|
| Augment Code Review | 65% | 55% | 59% |
| Cursor Bugbot | 60% | 41% | 49% |
| Greptile | 45% | 45% | 45% |
| Codex Code Review | 68% | 29% | 41% |
| CodeRabbit | 36% | 43% | 39% |
| Claude Code | 23% | 51% | 31% |
| GitHub Copilot | 20% | 34% | 25% |
Key insight: Precision and recall tell different stories. Claude Code had the second-highest recall (51% — it found a lot of bugs) but the lowest precision (23% — most of its comments were false positives). CodeRabbit showed the inverse pattern in Augment's test: reasonable recall, low precision.
Qodo Benchmark (2026) — Vendor
Qodo tested their multi-agent approach against eight tools across 100 PRs by injecting complex defects into real-world merged pull requests from active open-source repositories.
| Tool | F1 Score |
|---|---|
| Qodo (multi-agent) | 60.1% |
| Other tools | Not individually published |
Methodology: Qodo's injection approach inserts realistic bugs that simultaneously test code correctness and code quality. They designed defects to require understanding of the broader codebase, not just the diff — pushing tools toward deeper analysis.
CodeAnt AI Benchmark (2026) — Vendor
CodeAnt published results from a benchmark claiming 200,000 real pull requests — by far the largest dataset.
| Tool | F1 Score |
|---|---|
| CodeAnt AI | 51.7% |
| CodeRabbit | 51.2% |
| Others | Below 50% |
Scale claim: 200K PRs is dramatically larger than other benchmarks (50-100 PRs). However, the methodology details and raw data have not been published for independent verification.
Why every vendor wins their own benchmark
The pattern is consistent: Greptile tested Greptile and won. Augment tested Augment and won. Qodo tested Qodo and won. Three mechanisms explain this without assuming intentional manipulation:
1. The "caught" definition varies
Greptile counted a bug as caught when the tool identified the faulty code in a line-level comment. Augment counted precision and recall separately, penalizing noise. Martian used developer behavior — did someone actually fix the code? Three definitions, three different scores for the same tool on the same repository.
A tool that posts twenty comments per PR will score high on recall (it found the bug somewhere in those twenty comments) but low on precision (nineteen of those comments were noise). Depending on which metric you weight, the same tool looks excellent or mediocre.
2. Bug selection bias
Each vendor selects or injects bugs that align with their tool's strengths. A tool optimized for security vulnerabilities will look brilliant on a bug set heavy with injection flaws — and mediocre on a set dominated by logic errors or race conditions. Nobody publishes a benchmark where they lose.
3. Small sample sizes amplify noise
Fifty PRs spread across five repos means ten PRs per repo. At that scale, one ambiguous edge case can swing a tool's score by several percentage points. Greptile's 82% vs Augment's 45% on the same tool is statistically possible with 50 samples — the confidence intervals are wide enough to overlap.
The credibility spectrum
Not all benchmarks carry equal weight. Here is how to read them:
| Signal | More Credible | Less Credible |
|---|---|---|
| Who ran the test | Independent lab (Martian) | The vendor being tested |
| Raw data published | Yes (Martian, Augment) | No (some vendor claims) |
| Sample size | 100+ PRs (Qodo) | 50 PRs (Greptile, Augment) |
| Methodology | Open-source, reproducible | Described but not released |
| Scoring | Precision + recall + F1 | Single "catch rate" number |
| Multiple runs | Variance reported | Single run, no confidence interval |
Martian's benchmark currently sits at the top of this spectrum — independent researchers, open methodology, dual offline-online approach. But even their offline component used only 50 PRs.
What no benchmark has tested yet
Every published benchmark tests tools — CodeRabbit, Qodo, Greptile, Augment. None of them test the underlying models directly.
This matters because tools add layers on top of base models: custom prompts, retrieval-augmented generation, multi-agent workflows, post-processing filters. When CodeRabbit scores 51.2% F1, you don't know whether that reflects Claude's capability, CodeRabbit's prompt engineering, or their post-processing pipeline.
Questions that remain unanswered:
- Claude vs Gemini vs GPT on identical code review prompts — no head-to-head model comparison exists for the code review task specifically
- False positive rates by model — which model produces the most noise?
- Performance across 10+ languages — every benchmark uses the same five repos
- Cost-adjusted accuracy — which model gives you the most bugs per dollar?
- Hallucination rates — how often do models reference APIs, functions, or variables that don't exist?
What we're building: an open code review benchmark
We are constructing a benchmark designed to fill these gaps. The full methodology is published at BENCHMARK-METHODOLOGY.md and summarized here.
10 repositories, 10 languages
| Repo | Language | Stars | Domain |
|---|---|---|---|
| Sentry | Python | 40K+ | Error tracking |
| Cal.com | TypeScript | 33K+ | Scheduling |
| Grafana | Go | 65K+ | Observability |
| Keycloak | Java | 24K+ | Identity/Auth |
| Discourse | Ruby | 42K+ | Forum platform |
| Tokio | Rust | 28K+ | Async runtime |
| Folly | C++ | 28K+ | Performance library |
| Ktor | Kotlin | 13K+ | Web framework |
| Laravel | PHP | 80K+ | Web framework |
| Vapor | Swift | 24K+ | Server-side Swift |
We added Rust, C++, Kotlin, PHP, and Swift because current benchmarks only test Python, TypeScript, Go, Java, and Ruby. A model that excels on Python may struggle with Rust's ownership system or Swift's protocol-oriented patterns.
100 PRs with 150 injected bugs
Each repo contributes 10 PRs with 1-3 injected bugs per PR. Bug categories span five groups:
- Functional bugs (40%): off-by-one errors, null references, race conditions, resource leaks
- Security vulnerabilities (25%): mapped to CWE Top 25 — SQL injection, XSS, path traversal, SSRF, missing authorization
- Performance issues (15%): N+1 queries, unbounded collections, blocking calls in async contexts
- Code quality (15%): dead code, hardcoded secrets, missing input validation
- API misuse (5%): deprecated APIs, wrong argument ordering
Bug injection uses three methods: reversing real bug-fix commits (Greptile's approach), LLM-based synthetic injection with human validation (Qodo's approach), and reversing real CVE fixes (OWASP approach).
5 models, head-to-head
| Model | Provider | Context Window |
|---|---|---|
| Claude Opus 4.6 | Anthropic | 200K |
| Claude Sonnet 4.6 | Anthropic | 200K |
| Gemini 2.5 Pro | 1M | |
| GPT-4.1 | OpenAI | 1M |
| GPT-o3 | OpenAI | 200K |
Every model receives the same system prompt, the same PR diff, and the same repository context. Temperature set to 0.1 for reproducibility. Each PR runs three times per model with majority vote scoring to reduce variance.
Scoring: LLM-as-judge with published rubric
Following Martian's approach, an LLM judge classifies each model comment against ground truth:
- Exact match (1.0): correct file, correct line range (±5 lines), correct bug category
- Partial match (0.5): correct file, general area (±20 lines), related category
- No match (0.0): wrong file or wrong location
We report precision, recall, F1, false positive rate, hallucination rate, cost per PR, and latency. All with 95% bootstrap confidence intervals.
Everything open-source
The test suite, prompts, ground truth, raw API responses, and scoring code will be published under Apache 2.0. Anyone can reproduce our results, challenge our scoring, or add new models and repositories.
How to read benchmark numbers without getting misled
When you encounter AI code review benchmark claims — ours included — apply these filters:
Check who ran the test. If the vendor tested themselves, expect optimistic numbers. Look for independent evaluations or at least published raw data that others have verified.
Ask for precision AND recall. A single "accuracy" or "catch rate" number hides the precision-recall tradeoff. A tool with 80% recall and 20% precision catches most bugs but drowns you in false positives. A tool with 90% precision and 30% recall is quiet but misses most issues.
Look at the sample size. Fifty PRs is borderline. One hundred is adequate. Claims based on "thousands of PRs" without published methodology deserve skepticism — large numbers don't help if the scoring is opaque.
Check if results are reproducible. Can you run the same benchmark on your own code? If the test suite isn't published, the results are assertions, not evidence.
What practical accuracy means for your team
The raw F1 numbers — 45%, 51%, 60% — sound low. Here is what they mean in practice:
A tool with 55% F1 and 60% precision on a PR with 5 real issues will typically find 3 of them and add 2 false comments. A senior developer spends thirty seconds dismissing the false positives and saves fifteen minutes catching three bugs they might have missed.
The question isn't whether AI code review is perfect. The question is whether catching 3 out of 5 bugs automatically — even with some noise — is worth more than catching 0 out of 5 because nobody had time for a thorough review.
For teams that review 20+ PRs per week, even 50% recall with reasonable precision saves hours of review time. The math works at current accuracy levels. It works better when you run multiple models and compare results.
The same tool scores 36% to 82% depending on who tests it
We spent two days pulling every published benchmark into a single spreadsheet, and the result genuinely surprised us. The same tool, evaluated by different organizations on different bug sets, produces scores that barely overlap:
| Tool | Lowest Score | Highest Score | Range | Benchmarks |
|---|---|---|---|---|
| CodeRabbit | F1 36% (DeepSource) | F1 51.2% (Martian) | 15 points | 4 benchmarks |
| Greptile | F1 45% (Augment) | 82% catch (Greptile) | 37 points | 3 benchmarks |
| Cursor BugBot | F1 49% (Augment) | F1 80.5% (DeepSource) | 31 points | 2 benchmarks |
| Claude Code | F1 31% (Augment) | F1 62.4% (DeepSource) | 31 points | 2 benchmarks |
DeepSource tested on 165 real CVEs from the OpenSSF dataset — security vulnerabilities, not code quality issues. That single methodological choice flipped the leaderboard. Cursor BugBot went from middle-of-the-pack to first place. CodeRabbit dropped from respectable to last.
The takeaway is not that any benchmark is wrong. The takeaway is that a tool's score tells you how it performs on that specific test, not how it performs on your code.
What AI code review consistently misses
CodeRabbit did something unusual for a vendor — they published data that makes the entire category look bad. Their team analyzed 470 pull requests, split between 320 AI-co-authored and 150 human-only, and the pattern they found cuts against every marketing page in this space: AI catches the bugs that matter least and struggles with the bugs that cost you the most.
| Bug Category | AI vs Human | What This Means |
|---|---|---|
| Style and formatting | AI catches well | Lowest-impact issues — linters already handle these |
| Logic and correctness | 1.75x more errors in AI code | Misses domain-specific validation, edge cases |
| Concurrency bugs | ~2x more errors in AI code | Race conditions, deadlocks invisible in sequential tests |
| Security vulnerabilities (XSS) | 2.74x more in AI code | AI generates insecure XSS code 86% of the time |
| Architectural design flaws | 1.53x more in AI code | Privilege escalation paths, SOLID violations |
| Performance issues | 1.42x more in AI code | N+1 queries, connection pool leaks |
Source: CodeRabbit State of AI vs Human Code Generation Report, 470 PRs
A separate academic study tested GPT-4o and Gemini 2.0 Flash on code review specifically. GPT-4o achieved 68.5% correctness when given problem descriptions, but generated harmful suggestions — code changes that make things worse — 10.4% of the time. That number jumped to 23.8% when the model reviewed code without context about what to look for.
The practical risk is not just missed bugs. It is that AI sometimes introduces new ones through its suggestions.
47% of developers now use AI code review — but 96% don't trust it
Adoption doubled every year between 2023 and 2025, which is remarkable given that accuracy barely moved in the same period. The numbers tell a story about developer pragmatism over perfectionism:
| Source (Year) | AI Code Review Adoption |
|---|---|
| Stack Overflow (2023) | 11% |
| Stack Overflow (2024) | 22% |
| JetBrains DevEco (2025, 24,534 devs) | 44% |
| Stack Overflow (2025) | 47% |
| Jellyfish (Oct 2025) | 51.4% of teams |
The scale numbers are staggering when you line them up. GitHub reported 60 million Copilot code reviews since April 2025 — one in five pull requests on the entire platform now gets AI feedback before a human looks at it. Microsoft's internal engineering team runs AI review on 90% of PRs across 5,000 repositories, roughly 600,000 reviews per month. CodeRabbit just closed a $60 million Series B after reviewing 13 million PRs across 2 million repos. These are not experimental pilots — this is infrastructure.
The trust gap is where it gets interesting. SonarSource surveyed 1,100 developers and found that 42% of committed code is now AI-generated — projected to reach 65% by 2027. But 96% of developers do not fully trust that AI-generated code works correctly, and 38% say reviewing AI code takes more effort than reviewing human code.
LinearB's 2026 engineering benchmarks report put a number on what most reviewers already feel in their gut: AI-authored PRs carry 1.7x more issues than human PRs (10.83 vs 6.45 per PR) and get accepted at barely a third the rate — 32.7% versus 84.4% for human code. Teams also wait 4.6x longer before picking up AI PRs, which suggests reviewers treat machine-generated changes with a level of suspicion that the accuracy data now justifies.
The paradox: more AI-generated code creates more need for AI-assisted review, because humans cannot keep up with the volume. But the AI doing the review misses the same categories of bugs that the AI writing the code introduces.
What a code review costs — AI vs human vs production
Nobody publishes this comparison, so we calculated it from current API pricing and industry salary data.
| Method | Cost per PR | Recall (est.) | Cost per bug found | Speed |
|---|---|---|---|---|
| Senior developer ($150/hr, 20 min) | $50.00 | ~90% | $37.04 | 20 min |
| Mid-level developer ($80/hr, 30 min) | $40.00 | ~70% | $38.10 | 30 min |
| Claude Opus 4.6 (API) | $0.045 | ~55% | $0.055 | ~30 sec |
| Claude Sonnet 4.6 (API) | $0.027 | ~55% | $0.033 | ~15 sec |
| GPT-4.1 (API) | $0.016 | ~55% | $0.020 | ~10 sec |
| Gemini 2.5 Pro (API) | $0.014 | ~55% | $0.017 | ~10 sec |
| All 3 models combined | $0.075 | ~65% | $0.077 | ~30 sec |
Based on: median PR = 5,000 input tokens + 800 output tokens (arxiv research reports 3,937 avg tokens/PR). API pricing from official pages as of April 2026.
Running all three models on every PR costs seven and a half cents. A hundred PRs per month costs $7.50 in API fees. The same volume of human review at mid-level rates costs $4,000.
The cost argument for AI review has never been about replacing human reviewers. It is about running a $0.075 first pass that catches 55-65% of issues before a human spends twenty minutes on the ones that remain. The human reviewer focuses on architecture, business logic, and concurrency — the exact categories where AI performs worst.
A bug caught in code review costs roughly $37 to fix (30 minutes of developer time). The same bug found in production costs $1,125 on average — a 30x multiplier documented across multiple engineering organizations. One prevented production bug per month produces a 15,000% return on $7.50 in monthly API costs.
Frequently Asked Questions
Which AI code review tool has the highest accuracy in 2026?
Depends entirely on who ran the test. Qodo measured their own F1 at 60.1% across 100 PRs. Greptile measured themselves at 82% catch rate on 50 PRs. When Augment re-tested Greptile on the same repos, the score dropped to 45%. Every vendor wins their own benchmark — which is exactly the problem independent testing needs to solve.
What is a good F1 score for an AI code review tool?
Current tools hover around 50-60% F1 on independent benchmarks. Martian's evaluation put the best tools at 51-60% F1. Anything above 50% with false positive rate below 40% is competitive in April 2026. These numbers will climb as models improve throughout the year.
How do you benchmark AI code review tools fairly?
Three things matter: independent evaluation (not self-testing), standardized bug injection across identical PRs, and published methodology anyone can reproduce. Martian pioneered the dual offline-online approach. Our benchmark uses 100 PRs across 10 repos with LLM-as-judge scoring and all data on GitHub.
Does CodeRabbit catch 44% or 51% of bugs?
Both numbers come from legitimate tests. Greptile measured CodeRabbit at 44% catch rate on 50 PRs. Martian's independent benchmark measured 51.2% F1. The gap comes from different bug sets, different scoring definitions, and different repository selections.
Is Claude or GPT better for code review?
No head-to-head model comparison exists for code review specifically. SWE-bench tests code generation, not review. All existing benchmarks test tools rather than underlying models. We are building the first benchmark that tests Claude, Gemini, and GPT directly on the same 100 PRs with identical prompts.
What is the false positive rate for AI code review?
Augment's benchmark measured precision alongside recall. CodeRabbit showed 36% precision (64% false positive rate). Augment achieved 65% precision (35% false positives). Claude Code scored 23% precision (77% false positives). Below 40% false positives is the practical threshold where developers stop ignoring the tool.
How many PRs do you need for a reliable benchmark?
Greptile and Augment used 50 PRs — borderline for statistical significance. Qodo used 100. Our benchmark targets 100 PRs across 10 repos with 3 runs per model and bootstrap confidence intervals to ensure differences between models are statistically meaningful.
Can I run the benchmark myself?
The test suite will be open-source under Apache 2.0 on GitHub. It includes PR diffs, ground truth, scoring rubrics, and wrappers for Claude, Gemini, and GPT APIs. Reproduce our results or add your own models.
Sources and benchmark data
All benchmark data in this article comes from primary published sources:
Benchmarks:
- Martian Code Review Benchmark — withmartian.com/post/code-review-bench-v0 and GitHub
- Greptile AI Code Review Benchmarks — greptile.com/benchmarks
- Augment Code Review Benchmark — augmentcode.com/blog
- Qodo AI Code Review Benchmark — qodo.ai/blog
- CodeAnt AI Benchmark — codeant.ai/blogs
- DeepSource OpenSSF CVE Benchmark — deepsource.com/blog/ai-code-review-benchmarks
Adoption and industry data: 7. GitHub Copilot: 60M Code Reviews — github.blog ✅ Verified 8. CodeRabbit Series B ($60M, 13M PRs) — coderabbit.ai/blog ✅ Verified 9. CodeRabbit AI vs Human Code Report — coderabbit.ai/blog ✅ Verified 10. SonarSource State of Code 2026 — sonarsource.com/blog ✅ Verified 11. LinearB 2026 Engineering Benchmarks — linearb.io/resources 12. JetBrains Developer Ecosystem 2025 — devecosystem-2025.jetbrains.com 13. Stack Overflow Developer Survey 2025 — survey.stackoverflow.co/2025/ai 14. Microsoft AI Code Reviews at Scale — devblogs.microsoft.com
Academic research: 15. Evaluating LLMs for Code Review (arxiv 2505.20206) — GPT-4o 68.5% accuracy, 10.4% harmful suggestion rate 16. Automated Code Review In Practice (arxiv 2412.18531) — avg 3,937 tokens per PR review
Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.
Frequently Asked Questions
Which AI code review tool has the highest accuracy in 2026?
What is a good F1 score for an AI code review tool?
How do you benchmark AI code review tools fairly?
Does CodeRabbit actually catch 44% or 51% of bugs?
Is Claude or GPT better for code review?
What is the false positive rate for AI code review tools?
How many PRs do you need to benchmark a code review tool reliably?
Can I run the benchmark myself?
Try it on your next PR
AI reviews your code for bugs, security issues, and logic errors. You approve what gets published.
Free: 10 AI reviews/day, 1 repo. No credit card.
Related Articles
AI PR Review in 2026: What Actually Works (And What Wastes Your Team's Time)
AI PR review tools compared: CodeRabbit, Copilot, Bugbot, Git AutoReview. Real stats from Microsoft (5,000 repos), Qodo (609 devs), and setup guides for GitHub, GitLab, Bitbucket.
Pull Request Template: Complete Guide for GitHub, GitLab & Bitbucket (2026)
Copy-paste PR templates for GitHub, GitLab, Bitbucket & Azure DevOps. Real examples from React, Angular, Next.js & Kubernetes. Setup, enforcement, and AI review integration.
GitHub Copilot Code Review 2026: 60M Reviews In — Is It Worth $10/Month?
GitHub Copilot hit 60 million code reviews. We break down how it works, what it catches, what it misses, real pricing math for teams, and when alternatives like Git AutoReview make more sense.
Get the AI Code Review Checklist
25 traps that slip through PR review — with code examples. Plus weekly code review tips.
Unsubscribe anytime. We respect your inbox.