18 min read

Comparisons

AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison

6 benchmarks combined, one tool scores 36-51% depending who tests it. 47% of developers use AI review but 96% don't trust it. The data nobody showed you.

Git AutoReview TeamUpdated April 14, 202618 min read

Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.

Try it free5.0 on VS Code Marketplace

AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison

Five different organizations published AI code review benchmarks in the past six months. Each one crowned a different winner. Greptile measured an 82% catch rate — for themselves. When Augment tested Greptile on the same five repositories, the number dropped to 45%. CodeRabbit scored 44% in one benchmark and 51.2% in another. Qodo claimed 60.1% F1, the highest published score, on their own test suite.

None of these numbers are wrong. They just measured different things, scored differently, and — in every case — the organization running the test happened to win.

We collected every published benchmark, put the data side by side, and found the patterns hiding underneath the contradictions. This page is the result: one comparison table, a breakdown of why results diverge, and the methodology for an open benchmark anyone can reproduce.

Every AI code review benchmark in one table

Nobody has combined these results before. Here they are, sorted by the organization that ran each test:

Martian Benchmark (February 2026) — Independent

The closest thing to a neutral evaluation. Martian was founded by researchers from DeepMind, Anthropic, and Meta. They open-sourced their dataset, judge prompts, and evaluation pipeline.

Tool	F1 Score	Method
Qodo (multi-agent)	60.1%	Offline + online
CodeRabbit	51.2%	Offline + online
CodeAnt AI	51.7%	Offline only

Scale: Offline test on 50 PRs + online monitoring of real GitHub activity (Jan–Feb 2026). The online component tracked whether developers actually fixed code after AI comments — a behavioral signal that is harder to game than synthetic testing.

Scoring: A comment counted as useful if the developer changed code in response. Comments that developers ignored or dismissed scored against the tool.

Greptile Benchmark (2025) — Vendor

Greptile tested five tools on 50 PRs across five popular open-source repositories: Sentry (Python), Cal.com (TypeScript), Grafana (Go), Keycloak (Java), and Discourse (Ruby).

Tool	Catch Rate
Greptile	82%
Cursor	58%
Copilot	54%
CodeRabbit	44%
Graphite	6%

What "catch rate" means here: A bug counted as "caught" only when the tool explicitly identified the faulty code in a line-level comment and explained the impact. Default settings were used for all tools — no custom rules.

The problem: When Augment later tested the same five repositories, Greptile scored 45% — not 82%. Same repos, dramatically different results. The gap reveals how much the definition of "caught" and the specific bugs chosen can swing results.

Augment Code Benchmark (2026) — Vendor

Augment tested seven tools on the same five repositories as Greptile but expanded the ground truth dataset with manual verification.

Tool	Precision	Recall	F1
Augment Code Review	65%	55%	59%
Cursor Bugbot	60%	41%	49%
Greptile	45%	45%	45%
Codex Code Review	68%	29%	41%
CodeRabbit	36%	43%	39%
Claude Code	23%	51%	31%
GitHub Copilot	20%	34%	25%

Key insight: Precision and recall tell different stories. Claude Code had the second-highest recall (51% — it found a lot of bugs) but the lowest precision (23% — most of its comments were false positives). CodeRabbit showed the inverse pattern in Augment's test: reasonable recall, low precision.

Qodo Benchmark (2026) — Vendor

Qodo tested their multi-agent approach against eight tools across 100 PRs by injecting complex defects into real-world merged pull requests from active open-source repositories.

Tool	F1 Score
Qodo (multi-agent)	60.1%
Other tools	Not individually published

Methodology: Qodo's injection approach inserts realistic bugs that simultaneously test code correctness and code quality. They designed defects to require understanding of the broader codebase, not just the diff — pushing tools toward deeper analysis.

CodeAnt AI Benchmark (2026) — Vendor

CodeAnt published results from a benchmark claiming 200,000 real pull requests — by far the largest dataset.

Tool	F1 Score
CodeAnt AI	51.7%
CodeRabbit	51.2%
Others	Below 50%

Scale claim: 200K PRs is dramatically larger than other benchmarks (50-100 PRs). However, the methodology details and raw data have not been published for independent verification.

Why every vendor wins their own benchmark

The pattern is consistent: Greptile tested Greptile and won. Augment tested Augment and won. Qodo tested Qodo and won. Three mechanisms explain this without assuming intentional manipulation:

1. The "caught" definition varies

Greptile counted a bug as caught when the tool identified the faulty code in a line-level comment. Augment counted precision and recall separately, penalizing noise. Martian used developer behavior — did someone actually fix the code? Three definitions, three different scores for the same tool on the same repository.

A tool that posts twenty comments per PR will score high on recall (it found the bug somewhere in those twenty comments) but low on precision (nineteen of those comments were noise). Depending on which metric you weight, the same tool looks excellent or mediocre.

2. Bug selection bias

Each vendor selects or injects bugs that align with their tool's strengths. A tool optimized for security vulnerabilities will look brilliant on a bug set heavy with injection flaws — and mediocre on a set dominated by logic errors or race conditions. Nobody publishes a benchmark where they lose.

3. Small sample sizes amplify noise

Fifty PRs spread across five repos means ten PRs per repo. At that scale, one ambiguous edge case can swing a tool's score by several percentage points. Greptile's 82% vs Augment's 45% on the same tool is statistically possible with 50 samples — the confidence intervals are wide enough to overlap.

The credibility spectrum

Not all benchmarks carry equal weight. Here is how to read them:

Signal	More Credible	Less Credible
Who ran the test	Independent lab (Martian)	The vendor being tested
Raw data published	Yes (Martian, Augment)	No (some vendor claims)
Sample size	100+ PRs (Qodo)	50 PRs (Greptile, Augment)
Methodology	Open-source, reproducible	Described but not released
Scoring	Precision + recall + F1	Single "catch rate" number
Multiple runs	Variance reported	Single run, no confidence interval

Martian's benchmark currently sits at the top of this spectrum — independent researchers, open methodology, dual offline-online approach. But even their offline component used only 50 PRs.

What no benchmark has tested yet

Every published benchmark tests tools — CodeRabbit, Qodo, Greptile, Augment. None of them test the underlying models directly.

This matters because tools add layers on top of base models: custom prompts, retrieval-augmented generation, multi-agent workflows, post-processing filters. When CodeRabbit scores 51.2% F1, you don't know whether that reflects Claude's capability, CodeRabbit's prompt engineering, or their post-processing pipeline.

Questions that remain unanswered:

Claude vs Gemini vs GPT on identical code review prompts — no head-to-head model comparison exists for the code review task specifically
False positive rates by model — which model produces the most noise?
Performance across 10+ languages — every benchmark uses the same five repos
Cost-adjusted accuracy — which model gives you the most bugs per dollar?
Hallucination rates — how often do models reference APIs, functions, or variables that don't exist?

What we're building: an open code review benchmark

We are constructing a benchmark designed to fill these gaps. The full methodology is published at BENCHMARK-METHODOLOGY.md and summarized here.

10 repositories, 10 languages

Repo	Language	Stars	Domain
Sentry	Python	40K+	Error tracking
Cal.com	TypeScript	33K+	Scheduling
Grafana	Go	65K+	Observability
Keycloak	Java	24K+	Identity/Auth
Discourse	Ruby	42K+	Forum platform
Tokio	Rust	28K+	Async runtime
Folly	C++	28K+	Performance library
Ktor	Kotlin	13K+	Web framework
Laravel	PHP	80K+	Web framework
Vapor	Swift	24K+	Server-side Swift

We added Rust, C++, Kotlin, PHP, and Swift because current benchmarks only test Python, TypeScript, Go, Java, and Ruby. A model that excels on Python may struggle with Rust's ownership system or Swift's protocol-oriented patterns.

100 PRs with 150 injected bugs

Each repo contributes 10 PRs with 1-3 injected bugs per PR. Bug categories span five groups:

Functional bugs (40%): off-by-one errors, null references, race conditions, resource leaks
Security vulnerabilities (25%): mapped to CWE Top 25 — SQL injection, XSS, path traversal, SSRF, missing authorization
Performance issues (15%): N+1 queries, unbounded collections, blocking calls in async contexts
Code quality (15%): dead code, hardcoded secrets, missing input validation
API misuse (5%): deprecated APIs, wrong argument ordering

Bug injection uses three methods: reversing real bug-fix commits (Greptile's approach), LLM-based synthetic injection with human validation (Qodo's approach), and reversing real CVE fixes (OWASP approach).

5 models, head-to-head

Model	Provider	Context Window
Claude Opus 4.6	Anthropic	200K
Claude Sonnet 4.6	Anthropic	200K
Gemini 2.5 Pro	Google	1M
GPT-4.1	OpenAI	1M
GPT-o3	OpenAI	200K

Every model receives the same system prompt, the same PR diff, and the same repository context. Temperature set to 0.1 for reproducibility. Each PR runs three times per model with majority vote scoring to reduce variance.

Scoring: LLM-as-judge with published rubric

Following Martian's approach, an LLM judge classifies each model comment against ground truth:

Exact match (1.0): correct file, correct line range (±5 lines), correct bug category
Partial match (0.5): correct file, general area (±20 lines), related category
No match (0.0): wrong file or wrong location

We report precision, recall, F1, false positive rate, hallucination rate, cost per PR, and latency. All with 95% bootstrap confidence intervals.

Everything open-source

The test suite, prompts, ground truth, raw API responses, and scoring code will be published under Apache 2.0. Anyone can reproduce our results, challenge our scoring, or add new models and repositories.

How to read benchmark numbers without getting misled

When you encounter AI code review benchmark claims — ours included — apply these filters:

Check who ran the test. If the vendor tested themselves, expect optimistic numbers. Look for independent evaluations or at least published raw data that others have verified.

Ask for precision AND recall. A single "accuracy" or "catch rate" number hides the precision-recall tradeoff. A tool with 80% recall and 20% precision catches most bugs but drowns you in false positives. A tool with 90% precision and 30% recall is quiet but misses most issues.

Look at the sample size. Fifty PRs is borderline. One hundred is adequate. Claims based on "thousands of PRs" without published methodology deserve skepticism — large numbers don't help if the scoring is opaque.

Check if results are reproducible. Can you run the same benchmark on your own code? If the test suite isn't published, the results are assertions, not evidence.

What practical accuracy means for your team

The raw F1 numbers — 45%, 51%, 60% — sound low. Here is what they mean in practice:

A tool with 55% F1 and 60% precision on a PR with 5 real issues will typically find 3 of them and add 2 false comments. A senior developer spends thirty seconds dismissing the false positives and saves fifteen minutes catching three bugs they might have missed.

The question isn't whether AI code review is perfect. The question is whether catching 3 out of 5 bugs automatically — even with some noise — is worth more than catching 0 out of 5 because nobody had time for a thorough review.

For teams that review 20+ PRs per week, even 50% recall with reasonable precision saves hours of review time. The math works at current accuracy levels. It works better when you run multiple models and compare results.

Score variance across 6 benchmarks — the same tool produces wildly different scores depending on who tests it

The same tool scores 36% to 82% depending on who tests it

We spent two days pulling every published benchmark into a single spreadsheet, and the result genuinely surprised us. The same tool, evaluated by different organizations on different bug sets, produces scores that barely overlap:

Tool	Lowest Score	Highest Score	Range	Benchmarks
CodeRabbit	F1 36% (DeepSource)	F1 51.2% (Martian)	15 points	4 benchmarks
Greptile	F1 45% (Augment)	82% catch (Greptile)	37 points	3 benchmarks
Cursor BugBot	F1 49% (Augment)	F1 80.5% (DeepSource)	31 points	2 benchmarks
Claude Code	F1 31% (Augment)	F1 62.4% (DeepSource)	31 points	2 benchmarks

DeepSource tested on 165 real CVEs from the OpenSSF dataset — security vulnerabilities, not code quality issues. That single methodological choice flipped the leaderboard. Cursor BugBot went from middle-of-the-pack to first place. CodeRabbit dropped from respectable to last.

The takeaway is not that any benchmark is wrong. The takeaway is that a tool's score tells you how it performs on that specific test, not how it performs on your code.

AI Code Review Paradox — best at low-impact style bugs, worst at high-impact security and concurrency

What AI code review consistently misses

CodeRabbit did something unusual for a vendor — they published data that makes the entire category look bad. Their team analyzed 470 pull requests, split between 320 AI-co-authored and 150 human-only, and the pattern they found cuts against every marketing page in this space: AI catches the bugs that matter least and struggles with the bugs that cost you the most.

Bug Category	AI vs Human	What This Means
Style and formatting	AI catches well	Lowest-impact issues — linters already handle these
Logic and correctness	1.75x more errors in AI code	Misses domain-specific validation, edge cases
Concurrency bugs	~2x more errors in AI code	Race conditions, deadlocks invisible in sequential tests
Security vulnerabilities (XSS)	2.74x more in AI code	AI generates insecure XSS code 86% of the time
Architectural design flaws	1.53x more in AI code	Privilege escalation paths, SOLID violations
Performance issues	1.42x more in AI code	N+1 queries, connection pool leaks

Source: CodeRabbit State of AI vs Human Code Generation Report, 470 PRs

A separate academic study tested GPT-4o and Gemini 2.0 Flash on code review specifically. GPT-4o achieved 68.5% correctness when given problem descriptions, but generated harmful suggestions — code changes that make things worse — 10.4% of the time. That number jumped to 23.8% when the model reviewed code without context about what to look for.

The practical risk is not just missed bugs. It is that AI sometimes introduces new ones through its suggestions.

47% of developers now use AI code review — but 96% don't trust it

Adoption doubled every year between 2023 and 2025, which is remarkable given that accuracy barely moved in the same period. The numbers tell a story about developer pragmatism over perfectionism:

Source (Year)	AI Code Review Adoption
Stack Overflow (2023)	11%
Stack Overflow (2024)	22%
JetBrains DevEco (2025, 24,534 devs)	44%
Stack Overflow (2025)	47%
Jellyfish (Oct 2025)	51.4% of teams

The scale numbers are staggering when you line them up. GitHub reported 60 million Copilot code reviews since April 2025 — one in five pull requests on the entire platform now gets AI feedback before a human looks at it. Microsoft's internal engineering team runs AI review on 90% of PRs across 5,000 repositories, roughly 600,000 reviews per month. CodeRabbit just closed a $60 million Series B after reviewing 13 million PRs across 2 million repos. These are not experimental pilots — this is infrastructure.

The trust gap is where it gets interesting. SonarSource surveyed 1,100 developers and found that 42% of committed code is now AI-generated — projected to reach 65% by 2027. But 96% of developers do not fully trust that AI-generated code works correctly, and 38% say reviewing AI code takes more effort than reviewing human code.

LinearB's 2026 engineering benchmarks report put a number on what most reviewers already feel in their gut: AI-authored PRs carry 1.7x more issues than human PRs (10.83 vs 6.45 per PR) and get accepted at barely a third the rate — 32.7% versus 84.4% for human code. Teams also wait 4.6x longer before picking up AI PRs, which suggests reviewers treat machine-generated changes with a level of suspicion that the accuracy data now justifies.

The paradox: more AI-generated code creates more need for AI-assisted review, because humans cannot keep up with the volume. But the AI doing the review misses the same categories of bugs that the AI writing the code introduces.

Cost per bug found — AI review is 1,000x cheaper than human review

What a code review costs — AI vs human vs production

Nobody publishes this comparison, so we calculated it from current API pricing and industry salary data.

Method	Cost per PR	Recall (est.)	Cost per bug found	Speed
Senior developer ($150/hr, 20 min)	$50.00	~90%	$37.04	20 min
Mid-level developer ($80/hr, 30 min)	$40.00	~70%	$38.10	30 min
Claude Opus 4.6 (API)	$0.045	~55%	$0.055	~30 sec
Claude Sonnet 4.6 (API)	$0.027	~55%	$0.033	~15 sec
GPT-4.1 (API)	$0.016	~55%	$0.020	~10 sec
Gemini 2.5 Pro (API)	$0.014	~55%	$0.017	~10 sec
All 3 models combined	$0.075	~65%	$0.077	~30 sec

Based on: median PR = 5,000 input tokens + 800 output tokens (arxiv research reports 3,937 avg tokens/PR). API pricing from official pages as of April 2026.

Running all three models on every PR costs seven and a half cents. A hundred PRs per month costs $7.50 in API fees. The same volume of human review at mid-level rates costs $4,000.

The cost argument for AI review has never been about replacing human reviewers. It is about running a $0.075 first pass that catches 55-65% of issues before a human spends twenty minutes on the ones that remain. The human reviewer focuses on architecture, business logic, and concurrency — the exact categories where AI performs worst.

A bug caught in code review costs roughly $37 to fix (30 minutes of developer time). The same bug found in production costs $1,125 on average — a 30x multiplier documented across multiple engineering organizations. One prevented production bug per month produces a 15,000% return on $7.50 in monthly API costs.

Frequently Asked Questions

Which AI code review tool has the highest accuracy in 2026?

Depends entirely on who ran the test. Qodo measured their own F1 at 60.1% across 100 PRs. Greptile measured themselves at 82% catch rate on 50 PRs. When Augment re-tested Greptile on the same repos, the score dropped to 45%. Every vendor wins their own benchmark — which is exactly the problem independent testing needs to solve.

What is a good F1 score for an AI code review tool?

Current tools hover around 50-60% F1 on independent benchmarks. Martian's evaluation put the best tools at 51-60% F1. Anything above 50% with false positive rate below 40% is competitive in April 2026. These numbers will climb as models improve throughout the year.

How do you benchmark AI code review tools fairly?

Three things matter: independent evaluation (not self-testing), standardized bug injection across identical PRs, and published methodology anyone can reproduce. Martian pioneered the dual offline-online approach. Our benchmark uses 100 PRs across 10 repos with LLM-as-judge scoring and all data on GitHub.

Does CodeRabbit catch 44% or 51% of bugs?

Both numbers come from legitimate tests. Greptile measured CodeRabbit at 44% catch rate on 50 PRs. Martian's independent benchmark measured 51.2% F1. The gap comes from different bug sets, different scoring definitions, and different repository selections.

Is Claude or GPT better for code review?

No head-to-head model comparison exists for code review specifically. SWE-bench tests code generation, not review. All existing benchmarks test tools rather than underlying models. We are building the first benchmark that tests Claude, Gemini, and GPT directly on the same 100 PRs with identical prompts.

What is the false positive rate for AI code review?

Augment's benchmark measured precision alongside recall. CodeRabbit showed 36% precision (64% false positive rate). Augment achieved 65% precision (35% false positives). Claude Code scored 23% precision (77% false positives). Below 40% false positives is the practical threshold where developers stop ignoring the tool.

How many PRs do you need for a reliable benchmark?

Greptile and Augment used 50 PRs — borderline for statistical significance. Qodo used 100. Our benchmark targets 100 PRs across 10 repos with 3 runs per model and bootstrap confidence intervals to ensure differences between models are statistically meaningful.

Can I run the benchmark myself?

The test suite will be open-source under Apache 2.0 on GitHub. It includes PR diffs, ground truth, scoring rubrics, and wrappers for Claude, Gemini, and GPT APIs. Reproduce our results or add your own models.

Sources and benchmark data

All benchmark data in this article comes from primary published sources:

Benchmarks:

Martian Code Review Benchmark — withmartian.com/post/code-review-bench-v0 and GitHub
Greptile AI Code Review Benchmarks — greptile.com/benchmarks
Augment Code Review Benchmark — augmentcode.com/blog
Qodo AI Code Review Benchmark — qodo.ai/blog
CodeAnt AI Benchmark — codeant.ai/blogs
DeepSource OpenSSF CVE Benchmark — deepsource.com/blog/ai-code-review-benchmarks

Adoption and industry data: 7. GitHub Copilot: 60M Code Reviews — github.blog ✅ Verified 8. CodeRabbit Series B ($60M, 13M PRs) — coderabbit.ai/blog ✅ Verified 9. CodeRabbit AI vs Human Code Report — coderabbit.ai/blog ✅ Verified 10. SonarSource State of Code 2026 — sonarsource.com/blog ✅ Verified 11. LinearB 2026 Engineering Benchmarks — linearb.io/resources 12. JetBrains Developer Ecosystem 2025 — devecosystem-2025.jetbrains.com 13. Stack Overflow Developer Survey 2025 — survey.stackoverflow.co/2025/ai 14. Microsoft AI Code Reviews at Scale — devblogs.microsoft.com

Academic research: 15. Evaluating LLMs for Code Review (arxiv 2505.20206) — GPT-4o 68.5% accuracy, 10.4% harmful suggestion rate 16. Automated Code Review In Practice (arxiv 2412.18531) — avg 3,937 tokens per PR review

Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.

Try it free5.0 on VS Code Marketplace

Frequently Asked Questions

Which AI code review tool has the highest accuracy in 2026?

What is a good F1 score for an AI code review tool?

Current state of the art hovers around 50-60% F1 on independent benchmarks. Martian's independent evaluation put the best tools at 51-60% F1. Anything above 50% F1 with a false positive rate below 40% is competitive in April 2026. The field is still young — expect these numbers to climb as models improve.

How do you benchmark AI code review tools fairly?

Three things matter: independent evaluation (not self-testing), standardized bug injection across identical PRs, and published methodology anyone can reproduce. Martian pioneered the dual offline-online approach. Our open benchmark uses 100 PRs across 10 repos with LLM-as-judge scoring and all raw data published on GitHub.

Does CodeRabbit actually catch 44% or 51% of bugs?

Both numbers are real but come from different tests. Greptile measured CodeRabbit at 44% catch rate on 50 PRs. Martian's independent benchmark put CodeRabbit at 51.2% F1. The gap comes from different bug sets, different scoring definitions, and different repo selections. Neither is wrong — they measured different things.

Is Claude or GPT better for code review?

No head-to-head model comparison exists for code review specifically. SWE-bench tests code generation, not review. Existing benchmarks test tools (CodeRabbit, Qodo) not underlying models. We are building the first benchmark that tests Claude, Gemini, and GPT directly on the same 100 PRs with identical prompts.

What is the false positive rate for AI code review tools?

Augment's benchmark measured false positive rates alongside precision: tools with high recall often produce more noise. CodeRabbit showed 36% precision (64% of comments were false positives). Augment achieved 65% precision (35% false positives). Greptile landed at 45% precision. A practical threshold is below 40% false positive rate — above that, developers start ignoring the tool.

How many PRs do you need to benchmark a code review tool reliably?

Greptile and Augment used 50 PRs each — borderline for statistical significance. Qodo used 100 PRs, which provides adequate power for detecting 10% F1 differences between tools. Our benchmark targets 100 PRs across 10 repos with 3 runs per model and bootstrap confidence intervals.

Can I run the benchmark myself?

Yes. Our benchmark test suite will be open-source on GitHub under Apache 2.0. It includes PR diffs, ground truth annotations, scoring rubrics, and API wrappers for Claude, Gemini, and GPT. Anyone can reproduce our results or add new models and repos.

ai-code-reviewbenchmarkcoderabbitqodogreptileaugmentcopilotaccuracyprecision-recallf1-scorecode-review-tools

Try it on your next PR

AI reviews your code for bugs, security issues, and logic errors. You approve what gets published.

5.0 on Marketplace2-min setupYour code stays with you (BYOK)

Install Free Extension See Pricing

Free: 10 AI reviews/day, 1 repo. No credit card.

Tutorials

Bitbucket Pull Request Automation: Complete Guide 2026

Bitbucket PR automation in 2026: Pipelines triggers, AI code review, merge checks, and how to cut review time by 60% without leaving VS Code. Works on Cloud and Data Center.

14 min read

AI Code Review

Code Review Checklist for AI-Generated Code: 12 Things to Verify

AI writes code faster than developers can review it. Here are 12 things to check in every AI-generated PR — from hallucinated packages to security gaps, logic errors, and test coverage.

10 min read

Comparisons

GitHub Copilot Code Review Cost 2026: Actions Minutes Live June 1

June 1 happened: Copilot Code Review now burns GitHub Actions minutes plus AI Credits on private repos. We re-ran the cost math for teams of 5, 10, and 20 — one barely notices, one gets stung.

10 min read

Get the AI Code Review Checklist

25 PR bugs AI catches that humans miss — with real code examples. Free PDF, sent instantly.

One-click unsubscribe. We never share your email.

10 FREE reviews/day 87% cheaper

18 min read

Install Free

Back to Blog

Comparisons

AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison

6 benchmarks combined, one tool scores 36-51% depending who tests it. 47% of developers use AI review but 96% don't trust it. The data nobody showed you.

Git AutoReview TeamUpdated April 14, 202618 min read

Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.

Try it free5.0 on VS Code Marketplace

AI Code Review Benchmark 2026: Every Tool Tested, One Honest Comparison

None of these numbers are wrong. They just measured different things, scored differently, and — in every case — the organization running the test happened to win.

Every AI code review benchmark in one table

Nobody has combined these results before. Here they are, sorted by the organization that ran each test:

Martian Benchmark (February 2026) — Independent

The closest thing to a neutral evaluation. Martian was founded by researchers from DeepMind, Anthropic, and Meta. They open-sourced their dataset, judge prompts, and evaluation pipeline.

Tool	F1 Score	Method
Qodo (multi-agent)	60.1%	Offline + online
CodeRabbit	51.2%	Offline + online
CodeAnt AI	51.7%	Offline only

Scoring: A comment counted as useful if the developer changed code in response. Comments that developers ignored or dismissed scored against the tool.

Greptile Benchmark (2025) — Vendor

Greptile tested five tools on 50 PRs across five popular open-source repositories: Sentry (Python), Cal.com (TypeScript), Grafana (Go), Keycloak (Java), and Discourse (Ruby).

Tool	Catch Rate
Greptile	82%
Cursor	58%
Copilot	54%
CodeRabbit	44%
Graphite	6%

Augment Code Benchmark (2026) — Vendor

Augment tested seven tools on the same five repositories as Greptile but expanded the ground truth dataset with manual verification.

Tool	Precision	Recall	F1
Augment Code Review	65%	55%	59%
Cursor Bugbot	60%	41%	49%
Greptile	45%	45%	45%
Codex Code Review	68%	29%	41%
CodeRabbit	36%	43%	39%
Claude Code	23%	51%	31%
GitHub Copilot	20%	34%	25%

Qodo Benchmark (2026) — Vendor

Qodo tested their multi-agent approach against eight tools across 100 PRs by injecting complex defects into real-world merged pull requests from active open-source repositories.

Tool	F1 Score
Qodo (multi-agent)	60.1%
Other tools	Not individually published

CodeAnt AI Benchmark (2026) — Vendor

CodeAnt published results from a benchmark claiming 200,000 real pull requests — by far the largest dataset.

Tool	F1 Score
CodeAnt AI	51.7%
CodeRabbit	51.2%
Others	Below 50%

Scale claim: 200K PRs is dramatically larger than other benchmarks (50-100 PRs). However, the methodology details and raw data have not been published for independent verification.

Why every vendor wins their own benchmark

The pattern is consistent: Greptile tested Greptile and won. Augment tested Augment and won. Qodo tested Qodo and won. Three mechanisms explain this without assuming intentional manipulation:

1. The "caught" definition varies

2. Bug selection bias

3. Small sample sizes amplify noise

The credibility spectrum

Not all benchmarks carry equal weight. Here is how to read them:

Signal	More Credible	Less Credible
Who ran the test	Independent lab (Martian)	The vendor being tested
Raw data published	Yes (Martian, Augment)	No (some vendor claims)
Sample size	100+ PRs (Qodo)	50 PRs (Greptile, Augment)
Methodology	Open-source, reproducible	Described but not released
Scoring	Precision + recall + F1	Single "catch rate" number
Multiple runs	Variance reported	Single run, no confidence interval

Martian's benchmark currently sits at the top of this spectrum — independent researchers, open methodology, dual offline-online approach. But even their offline component used only 50 PRs.

What no benchmark has tested yet

Every published benchmark tests tools — CodeRabbit, Qodo, Greptile, Augment. None of them test the underlying models directly.

Questions that remain unanswered:

Claude vs Gemini vs GPT on identical code review prompts — no head-to-head model comparison exists for the code review task specifically
False positive rates by model — which model produces the most noise?
Performance across 10+ languages — every benchmark uses the same five repos
Cost-adjusted accuracy — which model gives you the most bugs per dollar?
Hallucination rates — how often do models reference APIs, functions, or variables that don't exist?

What we're building: an open code review benchmark

We are constructing a benchmark designed to fill these gaps. The full methodology is published at BENCHMARK-METHODOLOGY.md and summarized here.

10 repositories, 10 languages

Repo	Language	Stars	Domain
Sentry	Python	40K+	Error tracking
Cal.com	TypeScript	33K+	Scheduling
Grafana	Go	65K+	Observability
Keycloak	Java	24K+	Identity/Auth
Discourse	Ruby	42K+	Forum platform
Tokio	Rust	28K+	Async runtime
Folly	C++	28K+	Performance library
Ktor	Kotlin	13K+	Web framework
Laravel	PHP	80K+	Web framework
Vapor	Swift	24K+	Server-side Swift

100 PRs with 150 injected bugs

Each repo contributes 10 PRs with 1-3 injected bugs per PR. Bug categories span five groups:

Functional bugs (40%): off-by-one errors, null references, race conditions, resource leaks
Security vulnerabilities (25%): mapped to CWE Top 25 — SQL injection, XSS, path traversal, SSRF, missing authorization
Performance issues (15%): N+1 queries, unbounded collections, blocking calls in async contexts
Code quality (15%): dead code, hardcoded secrets, missing input validation
API misuse (5%): deprecated APIs, wrong argument ordering

5 models, head-to-head

Model	Provider	Context Window
Claude Opus 4.6	Anthropic	200K
Claude Sonnet 4.6	Anthropic	200K
Gemini 2.5 Pro	Google	1M
GPT-4.1	OpenAI	1M
GPT-o3	OpenAI	200K

Scoring: LLM-as-judge with published rubric

Following Martian's approach, an LLM judge classifies each model comment against ground truth:

Exact match (1.0): correct file, correct line range (±5 lines), correct bug category
Partial match (0.5): correct file, general area (±20 lines), related category
No match (0.0): wrong file or wrong location

We report precision, recall, F1, false positive rate, hallucination rate, cost per PR, and latency. All with 95% bootstrap confidence intervals.

Everything open-source

How to read benchmark numbers without getting misled

When you encounter AI code review benchmark claims — ours included — apply these filters:

Check who ran the test. If the vendor tested themselves, expect optimistic numbers. Look for independent evaluations or at least published raw data that others have verified.

Check if results are reproducible. Can you run the same benchmark on your own code? If the test suite isn't published, the results are assertions, not evidence.

What practical accuracy means for your team

The raw F1 numbers — 45%, 51%, 60% — sound low. Here is what they mean in practice:

Score variance across 6 benchmarks — the same tool produces wildly different scores depending on who tests it

The same tool scores 36% to 82% depending on who tests it

Tool	Lowest Score	Highest Score	Range	Benchmarks
CodeRabbit	F1 36% (DeepSource)	F1 51.2% (Martian)	15 points	4 benchmarks
Greptile	F1 45% (Augment)	82% catch (Greptile)	37 points	3 benchmarks
Cursor BugBot	F1 49% (Augment)	F1 80.5% (DeepSource)	31 points	2 benchmarks
Claude Code	F1 31% (Augment)	F1 62.4% (DeepSource)	31 points	2 benchmarks

The takeaway is not that any benchmark is wrong. The takeaway is that a tool's score tells you how it performs on that specific test, not how it performs on your code.

AI Code Review Paradox — best at low-impact style bugs, worst at high-impact security and concurrency

What AI code review consistently misses

Bug Category	AI vs Human	What This Means
Style and formatting	AI catches well	Lowest-impact issues — linters already handle these
Logic and correctness	1.75x more errors in AI code	Misses domain-specific validation, edge cases
Concurrency bugs	~2x more errors in AI code	Race conditions, deadlocks invisible in sequential tests
Security vulnerabilities (XSS)	2.74x more in AI code	AI generates insecure XSS code 86% of the time
Architectural design flaws	1.53x more in AI code	Privilege escalation paths, SOLID violations
Performance issues	1.42x more in AI code	N+1 queries, connection pool leaks

Source: CodeRabbit State of AI vs Human Code Generation Report, 470 PRs

The practical risk is not just missed bugs. It is that AI sometimes introduces new ones through its suggestions.

47% of developers now use AI code review — but 96% don't trust it

Adoption doubled every year between 2023 and 2025, which is remarkable given that accuracy barely moved in the same period. The numbers tell a story about developer pragmatism over perfectionism:

Source (Year)	AI Code Review Adoption
Stack Overflow (2023)	11%
Stack Overflow (2024)	22%
JetBrains DevEco (2025, 24,534 devs)	44%
Stack Overflow (2025)	47%
Jellyfish (Oct 2025)	51.4% of teams

Cost per bug found — AI review is 1,000x cheaper than human review

What a code review costs — AI vs human vs production

Nobody publishes this comparison, so we calculated it from current API pricing and industry salary data.

Method	Cost per PR	Recall (est.)	Cost per bug found	Speed
Senior developer ($150/hr, 20 min)	$50.00	~90%	$37.04	20 min
Mid-level developer ($80/hr, 30 min)	$40.00	~70%	$38.10	30 min
Claude Opus 4.6 (API)	$0.045	~55%	$0.055	~30 sec
Claude Sonnet 4.6 (API)	$0.027	~55%	$0.033	~15 sec
GPT-4.1 (API)	$0.016	~55%	$0.020	~10 sec
Gemini 2.5 Pro (API)	$0.014	~55%	$0.017	~10 sec
All 3 models combined	$0.075	~65%	$0.077	~30 sec

Based on: median PR = 5,000 input tokens + 800 output tokens (arxiv research reports 3,937 avg tokens/PR). API pricing from official pages as of April 2026.

Running all three models on every PR costs seven and a half cents. A hundred PRs per month costs $7.50 in API fees. The same volume of human review at mid-level rates costs $4,000.

Frequently Asked Questions

Which AI code review tool has the highest accuracy in 2026?

What is a good F1 score for an AI code review tool?

How do you benchmark AI code review tools fairly?

Does CodeRabbit catch 44% or 51% of bugs?

Is Claude or GPT better for code review?

What is the false positive rate for AI code review?

How many PRs do you need for a reliable benchmark?

Can I run the benchmark myself?

Sources and benchmark data

All benchmark data in this article comes from primary published sources:

Benchmarks:

Martian Code Review Benchmark — withmartian.com/post/code-review-bench-v0 and GitHub
Greptile AI Code Review Benchmarks — greptile.com/benchmarks
Augment Code Review Benchmark — augmentcode.com/blog
Qodo AI Code Review Benchmark — qodo.ai/blog
CodeAnt AI Benchmark — codeant.ai/blogs
DeepSource OpenSSF CVE Benchmark — deepsource.com/blog/ai-code-review-benchmarks

Tired of slow code reviews? AI catches issues in seconds. You decide what gets published.

Try it free5.0 on VS Code Marketplace

Frequently Asked Questions

Which AI code review tool has the highest accuracy in 2026?

What is a good F1 score for an AI code review tool?

How do you benchmark AI code review tools fairly?

Three things matter: independent evaluation (not self-testing), standardized bug injection across identical PRs, and published methodology anyone can reproduce. Martian pioneered the dual offline-online approach. Our open benchmark uses 100 PRs across 10 repos with LLM-as-judge scoring and all raw data published on GitHub.

Does CodeRabbit actually catch 44% or 51% of bugs?

Is Claude or GPT better for code review?

What is the false positive rate for AI code review tools?

How many PRs do you need to benchmark a code review tool reliably?

Can I run the benchmark myself?

ai-code-reviewbenchmarkcoderabbitqodogreptileaugmentcopilotaccuracyprecision-recallf1-scorecode-review-tools

Try it on your next PR

AI reviews your code for bugs, security issues, and logic errors. You approve what gets published.

5.0 on Marketplace2-min setupYour code stays with you (BYOK)

Install Free Extension See Pricing

Free: 10 AI reviews/day, 1 repo. No credit card.

Tutorials

Bitbucket Pull Request Automation: Complete Guide 2026

Bitbucket PR automation in 2026: Pipelines triggers, AI code review, merge checks, and how to cut review time by 60% without leaving VS Code. Works on Cloud and Data Center.

14 min read

AI Code Review

Code Review Checklist for AI-Generated Code: 12 Things to Verify

AI writes code faster than developers can review it. Here are 12 things to check in every AI-generated PR — from hallucinated packages to security gaps, logic errors, and test coverage.

10 min read

Comparisons

GitHub Copilot Code Review Cost 2026: Actions Minutes Live June 1

June 1 happened: Copilot Code Review now burns GitHub Actions minutes plus AI Credits on private repos. We re-ran the cost math for teams of 5, 10, and 20 — one barely notices, one gets stung.

10 min read

Get the AI Code Review Checklist

25 PR bugs AI catches that humans miss — with real code examples. Free PDF, sent instantly.

One-click unsubscribe. We never share your email.

Frequently Asked Questions

Try it on your next PR

Related Articles

Bitbucket Pull Request Automation: Complete Guide 2026

Code Review Checklist for AI-Generated Code: 12 Things to Verify

GitHub Copilot Code Review Cost 2026: Actions Minutes Live June 1

Get the AI Code Review Checklist

Frequently Asked Questions

Try it on your next PR

Related Articles

Bitbucket Pull Request Automation: Complete Guide 2026

Code Review Checklist for AI-Generated Code: 12 Things to Verify

GitHub Copilot Code Review Cost 2026: Actions Minutes Live June 1

Get the AI Code Review Checklist