I had a suspicion that I was using Claude Code wrong. I’d let it write code all day, then push the PR and wait for a teammate to find the obvious stuff. The bug I’d have caught myself if I’d read my own diff slower.
So I made a rule for a month. Every PR I opened got reviewed by Claude Code first. Before I asked a human. Before I even read it myself the second time. The code Claude helped write, reviewed by Claude. I wanted to know if that’s a real safety net or just a tool grading its own homework.
A month later I have opinions. Some of them surprised me.
The setup
Nothing fancy. I had a slash command that ran on the diff before I requested review.
Review the diff on this branch against main. Look for:
- Bugs that fail at runtime (null access, wrong types, bad async)
- Logic errors (wrong condition, off-by-one, missed edge case)
- Security issues (unvalidated input, leaked secrets, injection)
Ignore style, naming, and formatting. We have a linter for that.
For each issue: file, line, what breaks, and the smallest fix.
If the diff is clean, say "clean" and stop. Don't invent work.
That last line matters more than anything else in the prompt. I’ll come back to it.
I ran this on 41 PRs over the month. Real work, not toy branches. Some tiny, some 600-line monsters I should’ve split up.
What it actually caught
The wins were boring, which is exactly what you want from a reviewer.
It caught a Promise I forgot to await inside a map. The kind of thing that passes every test because the test finishes before the promise rejects, then blows up in production at 1% of traffic. I’ve shipped that bug before. It cost me a Saturday once.
It caught an env var I’d read as process.env.STRIPE_KEY in one file and process.env.STRIPE_SECRET_KEY in another. Both my code. Claude noticed they didn’t match. I would not have, not until the webhook silently failed.
It caught a SQL query where I’d built the string with a template literal and dropped a user-supplied value straight in. I know better. I was tired. The reviewer wasn’t.
Here’s the pattern I noticed: it’s best at the dumb, mechanical mistakes that come from being human at 6pm. The stuff that isn’t about being smart, it’s about being awake.
What it missed
This is the part nobody puts in the demo video.
It missed anything that needed context outside the diff. I changed a function’s return shape and updated three callers. There was a fourth caller in a different service that read from the same API. Claude couldn’t see it, so it said “clean.” That’s not a bug in the tool, it’s the limit of reviewing a diff in isolation. But if you think the green checkmark means “safe to merge,” it’ll bite you.
It missed a race condition that you could only see if you understood the order two webhooks arrive in. The code looked correct line by line. It was wrong as a system. That’s the kind of thing a senior teammate catches in five seconds and an AI reviewer, reading the text, does not.
And it never once questioned whether the feature should exist. A human reviewer sometimes says “wait, why are we doing this at all.” That question saved me more grief over the year than any line-level bug catch. Claude doesn’t ask it. It assumes the diff is the goal.
The false positives
Early on, maybe one in three findings was noise. “This could throw if x is null.” Sure, but x is a route param that Express guarantees is a string. Or “consider adding error handling here,” pointing at a line that already had a try/catch two functions up the stack.
The noise wasn’t useless, it was just expensive. Every false positive is a few seconds of me reading, thinking, and deciding it’s wrong. Do that ten times a PR and the reviewer is costing me more time than it saves.
Two things fixed most of it.
First, that “don’t invent work” line. Without it, an AI reviewer will always find something, because finding nothing feels like failing the task. Give it explicit permission to say “clean” and the noise drops hard.
Second, I started feeding it a tiny bit of context at the top. One sentence: “This is an internal admin endpoint, auth is handled by middleware, don’t flag missing auth checks.” That killed a whole category of confident, wrong findings.
The thing I didn’t expect
It made me read my own diffs better.
Knowing a reviewer was going to run, I started writing tighter PRs. Smaller. One concern each. Because a 600-line diff gives the reviewer too much surface and me too much output to wade through. The tool didn’t make my code better directly. It made me change the habit that made my code worse.
That’s a weird kind of value. Not “the AI caught the bug.” More like “knowing the AI would read it made me stop being sloppy.”
What I kept, what I dropped
I kept the review step. It pays for itself on the boring bugs alone, and those are the ones that wake me up at night.
I dropped the idea that it replaces a human. It doesn’t, and the month made that obvious. It’s a first pass that clears the dumb stuff so the human reviewer spends their attention on the design, the “why,” and the cross-system effects the diff can’t show.
I also stopped trusting “clean” as a verdict. “Clean” means “I found nothing in these lines.” It does not mean “this is correct.” Big difference. I treat it as a smoke detector, not a structural engineer.
The honest answer to “should Claude review its own PRs” is yes, with a clear head about what you’re getting. It’s a fast, tireless junior reviewer that’s great at catching the mistakes you make when you’re tired and blind to the ones that need context it can’t see. Use it for the first pass. Keep the human for the part that actually requires judgment. And never let a green checkmark talk you out of reading your own code.