
AI Made Your PRs 33% Bigger. Reviews Got Worse.
Median pull request size jumped from 57 to 76 lines between March and November 2025, a 33% increase in eight months. That sounds harmless until you map it against defect detection rates: reviewers catch 87% of bugs in PRs under 100 lines, but only 28% in PRs over 1,000 lines.
AI didn't just speed up writing. It quietly broke the math of code review.
The PR That Nobody Actually Read
You've seen the message in Slack. "Quick review please, mostly mechanical." Then you open it: 47 files, 1,400 additions, 600 deletions. Tests, configs, a "small refactor." The author wrote this in two hours with an AI assistant. You're expected to review it in twenty minutes.
So you do what every honest engineer does. You scroll fast, you skim the test names, you check that CI is green, and you click approve. You tell yourself the author knows the code best. You tell yourself the tests will catch anything serious.
The data says otherwise. Google's internal research and a decade of industry studies converge on the same line: review effectiveness collapses past 200 lines and falls off a cliff after 600. Teams that keep PRs under 400 lines report roughly 40% fewer production defects and review cycles that complete three times faster. The ideal change, by Graphite's measurement of merged PRs, is closer to 50 lines than 500.
This isn't taste. It's cognitive load. After 60 to 90 minutes of focused review, defect discovery rates drop sharply. A 2,000-line PR doesn't get a 2,000-line review. It gets a tired, distracted, pattern-matching skim. The reviewer rationalizes the gaps. The bug ships.
The Real Cost of "It's Mostly Mechanical"
The rationalizations all sound reasonable in isolation. "It's all renaming." "Most of it is generated." "The tests cover it." But mechanical changes are exactly where reviewers stop reading carefully, and that is where AI tools insert subtle behavior shifts: an off-by-one in a loop bound, a missing null check in an extracted helper, a regex that looks identical but matches one extra character.
Developers now spend 11.4 hours per week reviewing AI-generated code versus 9.8 hours writing it. Review is the new bottleneck, and bigger PRs make the bottleneck worse, not better. The feeling of speed at write time is paid for, with interest, at review time.
Worse, big PRs degrade everything downstream. They're harder to roll back when something breaks. They block the deploy pipeline longer. They bury the actual narrative of a change under noise. When you read the git log six months later, "WIP refactor" tells you nothing about why a critical line changed.
What a Good PR Looks Like
A reviewable PR has three properties. It is small enough to hold in one head. It tells one story. It can be reverted without taking unrelated work down with it.
Concretely:
- One logical change per PR. Refactoring lives in its own PR. Renames live in their own PR. New behavior lives in its own PR.
- Under 400 lines changed when possible, under 200 when reasonable. Anything over 600 needs a stronger justification than "it was easier this way."
- A description that explains the why, not the what. The diff shows the what. Tell the reviewer what to look at first and what to ignore.
- Tests that match the unit of change. New behavior gets new tests in the same PR. Don't promise tests in a follow-up.
Splitting a big change into a stack of small PRs feels like more work for the author. It is. It's also what makes the difference between a review that catches the SQL injection and one that nods through it.
When the PR Is Big Anyway
Sometimes the change really is large. A migration, a framework upgrade, a vendored dependency. The review still has to happen, and the reviewer still has finite attention. This is where automated review earns its keep, not by replacing the human, but by triaging the wall of diff before the human looks at it.
Octopus Review indexes your entire codebase with RAG-powered vector search, so when it opens a 1,200-line PR, it already knows what the rest of your code looks like. It sees that a function being introduced here duplicates one that already exists three modules over. It sees that the new error handler swallows an exception type that another service relies on. The diff alone can't show that. The codebase context can.
Then it labels what it finds across five severity levels: Critical, Major, Minor, Suggestion, and Tip. A reviewer staring down 47 files doesn't read every comment. They read the Critical and Major ones, then they spot-check the rest. Calibrated severity is what makes a 1,200-line PR triage-able instead of skim-able.
A typical output on a real PR looks like this:
๐ด **Critical** โ SQL injection via unsanitized user input
`src/api/users.ts:47`
The `userId` parameter is interpolated directly into the SQL query without
parameterization. Use a prepared statement instead.
๐ก **Minor** โ Unused import
`src/utils/helpers.ts:3`
`lodash` is imported but never referenced in this file.
๐ก **Tip** โ Consider extracting shared validation
`src/api/orders.ts:112`
This validation logic duplicates what's in `src/api/users.ts:89`.
A shared validator would reduce maintenance surface.
Trigger one against a real PR with a single command:
octopus review 42
That's it. The review runs against the indexed codebase, comments post inline, and the reviewer walks into the PR with the critical issues already surfaced. Their human attention goes to architecture, intent, and trade-offs, not to grep-by-eye.
The Discipline That Compounds
Small PRs aren't a stylistic preference. They are how you keep review effectiveness high while AI keeps making it cheap to write more code. The teams that stay fast in 2026 are the ones that pair tight PR discipline with codebase-aware review tooling. The teams that don't are watching defect rates climb while their dashboards still say "approved."
Try Octopus Review on your next big PR at octopus-review.ai. Star the repo on GitHub if context-aware code review is the kind of thing you want more of, and come argue about ideal PR size with us in Discord.