AI-Assisted Code Review: What Works, What Doesn't
Two Years In
We rolled out AI-assisted code review to our engineering organisation roughly two years ago. I can now give you an honest assessment that is not vendor marketing and not dismissive scepticism.
The short version: AI code review tools are genuinely useful for a narrow but high-value set of tasks, genuinely poor at another set, and quietly corrosive if you do not manage how they integrate with your human review process. The teams that got the most value were the ones that were deliberate about the division of labour. The teams that got burned were the ones that turned it on and hoped for the best.
This post is what I wish someone had written before we started.
The Review Taxonomy
To evaluate AI review tools honestly, you need to separate code review into categories. AI performs very differently across them.
AI tools are excellent at Security and mechanical Correctness (null checks, resource leaks, common anti-patterns). They are decent at Operational concerns. They are poor at Design and essentially useless at Intent.
The teams that were disappointed had expected AI to perform across all categories. The teams that were satisfied had targeted it at Security and Correctness and kept humans responsible for Design and Intent.
What AI Does Well
Security vulnerability detection
This is the strongest category. The model has been trained on millions of CVEs and security advisories. It pattern-matches against known vulnerability classes reliably.
// AI correctly flags this as SQL injection risk
public List<User> searchUsers(String query) {
String sql = "SELECT * FROM users WHERE name LIKE '%" + query + "%'";
return jdbcTemplate.query(sql, userRowMapper);
}
// AI suggests:
// SQL injection vulnerability. User input directly concatenated into query.
// Fix: use parameterized query with PreparedStatement
// OWASP A03:2021 – InjectionIn our benchmarks against a corpus of known-vulnerable code, AI tools caught 83% of OWASP Top 10 instances — higher than the human reviewer baseline for the same corpus. Critically, AI catches these consistently regardless of reviewer experience or how late in the sprint the review happens.
Hardcoded secrets and credentials
# AI flags immediately
DB_PASSWORD = "Sup3rS3cretP@ssw0rd!"
API_KEY = "sk-proj-abc123xyz"
# AI comment:
# Hardcoded credentials detected. This will be committed to version history.
# Move to environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault).This class of finding is pure pattern matching. AI is faster and more reliable than humans for it.
Resource leak detection
// AI catches the missing close()
public String readConfig(String path) throws IOException {
FileReader reader = new FileReader(path); // Never closed
BufferedReader br = new BufferedReader(reader);
return br.readLine();
// AI: FileReader not closed. Use try-with-resources.
}What AI Does Poorly
Business logic correctness
AI does not know your domain. It cannot know whether the 15% discount rule applies to all users or only premium users unless it has that context. Even with context injection, it often gets domain rules wrong.
# AI sees no bug here
def calculate_refund(order, days_since_purchase):
if days_since_purchase <= 30:
return order.total * 0.9 # 10% restocking fee
return 0
# But the product requirement is:
# Premium users get 100% refund within 30 days
# Regular users get 90% refund within 30 days
# AI has no way to know this rule existsBusiness logic bugs require reading the ticket, understanding the domain model, and knowing the product requirements. These are human concerns.
Test quality assessment
This is the area where AI feedback is most misleading. AI will approve tests that pass but do not protect.
# AI: "Good test coverage for the calculate_discount function."
def test_calculate_discount():
result = calculate_discount(100, 10)
assert result is not None # Passes. Means nothing.
# The meaningful test AI missed:
def test_calculate_discount_with_zero_price():
with pytest.raises(ValueError):
calculate_discount(0, 10) # Should it raise? Return 0? AI doesn't know.AI can count tests. It cannot judge whether the tests are actually testing the right things. A high-coverage test suite that only tests happy paths passes AI review with flying colours.
Architectural impact
AI reviews one PR at a time. It does not have a mental model of how the codebase has evolved, what the intended architecture is, or how this change fits into the larger system trajectory.
A change that looks perfectly reasonable in isolation might be adding a new pattern to a module that was supposed to be deprecated, or duplicating logic that was recently centralised elsewhere. AI misses this entirely.
The False Positive Problem
This is the most practically damaging issue with AI code review: false positives erode trust.
A false positive is a comment that flags something as a bug or concern when there is nothing wrong. Early in our rollout, our AI tool was generating 4-6 false positive comments per PR. Within three months, engineers had started dismissing all AI comments without reading them.
To avoid this:
- Tune confidence thresholds: Only surface comments above a confidence threshold. Better to miss a few bugs than to flood reviewers with noise.
- Categorise by severity: CRITICAL and HIGH findings are blocking. MEDIUM and LOW are non-blocking suggestions. Engineers pay attention to blocking comments.
- Measure false positive rate monthly: Track it. Set an SLA. If false positives per 100 comments exceed 15%, the tool needs tuning.
Integration Architecture
The pattern that works best is AI as a pre-screen before human review, not as a replacement.
The key insight: AI review should make human review better, not replace it. When AI has already caught the obvious issues, human reviewers spend their cognitive budget on the things that actually require a human.
Configuration that matters
# .github/ai-review.yml
rules:
security:
severity_threshold: HIGH # Only post HIGH and CRITICAL security findings
categories:
- OWASP_INJECTION
- HARDCODED_SECRETS
- INSECURE_DESERIALIZATION
- PATH_TRAVERSAL
quality:
severity_threshold: CRITICAL # Only block on critical quality issues
suggestions:
post_as: NON_BLOCKING # Style and minor improvements — visible but not blocking
max_per_pr: 5 # Cap suggestions to prevent noise
exclude_paths:
- "**/*.generated.java" # Don't review generated code
- "**/vendor/**"
- "**/__tests__/**" # Human reviews test qualityLess is more. Start with security findings only. Add more categories once you have calibrated the false positive rate.
Measuring Impact
Track these metrics from week one:
AI Code Review Health Dashboard
═══════════════════════════════════════════════
Signal quality
True positive rate: 87% (target: >80%)
False positive rate: 9% (target: <15%)
Engineer dismissal rate: 12% (rising → investigate)
Impact
Security bugs caught pre-merge: +83%
Review cycle time: -38%
Human review comments per PR: -29% (focused on intent)
Coverage
PRs with AI review: 98%
Findings acted on: 73%
Findings marked false +ve: 9%
Findings dismissed no reason: 18% ← investigate this
═══════════════════════════════════════════════The "dismissed without reason" metric is the canary. When it rises, engineers are ignoring AI comments. When engineers ignore AI comments, you are paying for a tool that produces noise. Investigate and tune before trust collapses.
The Honest Summary
AI code review is a force multiplier for security and pattern-based quality concerns. It is not a replacement for human judgement on design, intent, and domain correctness. The teams that understand this distinction get real value. The teams that do not end up either over-relying on AI (and shipping logic bugs) or dismissing it entirely (and losing the security benefits).
The division of labour is simple: let AI handle the things it is reliably good at (security, resource management, obvious anti-patterns) and give human reviewers the space to focus on the things only humans can do (product correctness, architectural fit, test quality).
That combination is genuinely better than either alone.