
Building an AI Code Review Tool: Architecture and Lessons Learned
Code reviews are one of those things every team agrees are important but nobody enjoys waiting for. You open a pull request, your reviewer is heads-down on something else, and the PR sits there. When the review does come, the quality depends on who reviewed it, how busy they were, and whether it was 10am or 5pm.
I've been thinking about this problem for a while. Last year I started building Octopus, a tool that uses LLMs and vector search to automatically review pull requests and post findings as inline comments. It's not meant to replace human reviewers, but to handle the repetitive stuff so humans can focus on architecture and design decisions.
This post is a walkthrough of how it works under the hood, what I built, what was harder than expected, and what I'd do differently.
Why I Started Building This
The problems with code reviews are well-known:
- Turnaround time. PRs waiting for review block other work. The longer they sit, the more context everyone loses.
- Inconsistency. Different reviewers catch different things. Standards drift over time.
- Scaling issues. As teams grow, senior developers become review bottlenecks.
I didn't set out to "solve" code review. That's too broad. I wanted to build something that could catch common issues consistently and give developers faster feedback on their PRs.
The Architecture: From Webhook to Review Comment
The core pipeline is surprisingly straightforward. Here's what happens when you open a PR:
GitHub/Bitbucket Webhook
→ Fetch PR diff
→ Search codebase for relevant context (vector search)
→ Send diff + context to LLM
→ Parse and filter findings
→ Post inline comments on the PR
Let me break down each step.
Step 1: Catching the Webhook
When a pull request is opened or updated, GitHub sends a webhook to Octopus. The handler verifies the HMAC-SHA256 signature, extracts the diff, and kicks off the review pipeline.
I also handle issue_comment events so that when someone comments @octopus review this, the system triggers a manual review. This turned out to be one of the most-used features. Teams love being able to re-trigger reviews after making changes.
Step 2: Indexing the Codebase
Before the AI can review code intelligently, it needs to understand the entire codebase, not just the diff. This is where things get interesting.
When a repository connects to Octopus for the first time, we clone it and split the code into chunks of 1,500 characters with 200-character overlaps. Each chunk gets embedded using OpenAI's text-embedding-3-large model (3,072 dimensions) and stored in Qdrant, a vector database.
Why 1,500 characters? Too small and you lose context. Too large and the embeddings become noisy. After experimenting with different sizes, 1,500 with 200-char overlap gave the best retrieval accuracy for code.
We store five types of vector collections:
- code_chunks: the indexed repository code
- knowledge_chunks: organization-wide knowledge base
- review_chunks: past reviews (for learning)
- chat_chunks: conversation history
- diagram_chunks: architecture diagrams
Step 3: Hybrid Search for Context
When a PR comes in, we take the first 8,000 characters of the diff and embed them as a query vector. Then we run a hybrid search against Qdrant, combining dense vectors (semantic similarity) with sparse vectors (BM25-like keyword matching) using Reciprocal Rank Fusion.
The sparse vector generation was a fun challenge. I implemented a custom tokenizer that splits camelCase and snake_case identifiers, filters stop words, and uses FNV-1a hashing across a 262K hash space with log-normalized TF scoring.
After the initial search returns ~50 results, we rerank them using Cohere's rerank-v3.5 model. This step is crucial. It takes the rough results from vector search and re-scores them based on actual relevance to the diff. We keep the top 15 code chunks and top 8 knowledge chunks.
This two-stage retrieval (search then rerank) made a noticeable difference in context quality compared to vector search alone.
Step 4: The LLM Review
This is the core step. We send the diff, the retrieved code context, the repo's file tree, and the organization's knowledge base to the LLM.
Octopus supports multiple AI providers:
- Anthropic Claude (default, Claude Sonnet 4)
- OpenAI (GPT-4o, o1, o3)
- Google Gemini
Organizations can configure which model to use at the org or repo level. We also support bringing your own API keys, so if you want to use your own Anthropic account, you can.
One optimization worth mentioning: prompt caching. Claude supports caching the system prompt across requests. Since our system prompt (which includes the codebase context) is large but changes infrequently per repo, caching saves a lot of tokens on repeated reviews.
Step 5: Parsing and Filtering Findings
The LLM returns findings with severity levels:
- 🔴 Critical: Security vulnerabilities, data loss risks
- 🟠 High: Bugs that will cause issues in production
- 🟡 Medium: Code quality issues, potential problems
- 🔵 Low: Minor improvements, style suggestions
- 💡 Info: Educational notes, best practices
But raw LLM output needs filtering. We apply several layers:
- Confidence threshold: only keep HIGH and MEDIUM confidence findings
- Disabled categories: respect org/repo preferences
- Semantic feedback matching: if a user previously dismissed a similar finding (👎), suppress it. This uses embedding similarity with a 0.85 threshold
- Optional two-pass validation: a second LLM call confirms each finding against the actual diff
- Cap at 30 findings, prioritized by severity
The feedback learning loop is one of my favorite features. When developers react to findings with 👍 or 👎, Octopus learns. Over time, false positives decrease because the system remembers what your team considers noise.
Step 6: Posting Results
Findings are posted as inline PR comments using GitHub's Review API. Each comment appears exactly on the relevant line, just like a human review. We also post a summary table at the top of the review.
For Bitbucket, the same pipeline works through a provider abstraction layer. One codebase, two platforms.
The Tech Stack
Here's what powers Octopus:
| Layer | Technology |
|---|---|
| Framework | Next.js 16 (App Router) |
| Language | TypeScript (strict mode) |
| Runtime | Node.js + Bun |
| Database | PostgreSQL + Prisma ORM |
| Vector DB | Qdrant (self-hosted) |
| Auth | Better Auth (Google, GitHub OAuth, magic link) |
| AI | Anthropic Claude, OpenAI, Google Gemini |
| Reranking | Cohere rerank-v3.5 |
| Real-time | Pubby (WebSocket) |
| Payments | Stripe |
| UI | shadcn/ui + Radix + Tailwind CSS 4 |
| Monorepo | Turborepo |
Why This Stack?
Next.js 16 with App Router gave me server components, server actions, and API routes in one place. For a tool that needs both a dashboard UI and webhook endpoints, this was perfect.
Qdrant over Pinecone because I wanted self-hosting capability. Enterprise customers with strict security requirements need their code to stay on their infrastructure. Qdrant's hybrid search (dense + sparse vectors) was also a differentiator.
Better Auth over NextAuth because it has first-class Prisma support and a plugin system that made adding features like magic links straightforward.
Pubby for real-time because reviews aren't instant. They take 15-60 seconds. Users need to see progress. We push real-time status updates (fetching-diff → searching-context → generating-review → posting-comment) through organization-scoped WebSocket channels so the dashboard shows live progress.
Hard Problems I Had to Solve
The Context Window Problem
An LLM can't review code well if it only sees the diff. It needs to understand how the changed code relates to the rest of the codebase. But you can't stuff an entire repository into a context window.
The solution was the hybrid search + reranking pipeline I described above. Instead of sending everything, we send the most relevant parts of the codebase. The AI gets enough context to understand the changes without hitting token limits.
False Positive Management
Early versions of Octopus were noisy. The AI would flag things that weren't actually problems, and developers would start ignoring all findings. This is the worst possible outcome for a review tool.
The feedback loop solved this. But I also added:
- Semantic deduplication: don't post the same finding twice on re-reviews
- Re-review mode: on follow-up reviews, only show critical findings
- Configurable severity thresholds: teams can choose which severity levels get inline comments vs. go into the summary
Cost Management
LLM calls are expensive. A single review can cost $0.05-0.50 depending on the diff size and model. At scale, this adds up fast.
I built a cost tracking system that logs every token (input, output, cached reads, cached writes) per organization, repository, and operation type. Organizations have configurable monthly spend limits (default $150/month). Before every expensive operation, the system checks isOrgOverSpendLimit().
Prompt caching on Claude helped a lot here. The system prompt with codebase context gets cached, so repeated reviews on the same repo reuse the cached prompt and only pay for the diff portion.
Build Artifact Detection
You'd be surprised how many PRs accidentally include node_modules/, dist/ folders, or package-lock.json changes. Octopus automatically detects these with regex patterns and flags them as critical findings before running the full review. Simple but saves teams a lot of headache.
What I Learned
1. Retrieval Quality > Model Quality
Switching from GPT-4 to Claude improved review quality. But adding Cohere reranking on top of vector search improved it even more. The context you feed the LLM matters at least as much as which LLM you pick, probably more.
2. Feedback Loops Are Everything
The 👍/👎 system seemed like a nice-to-have when I built it. It turned out to be critical for long-term adoption. Without it, false positives pile up and people start ignoring the tool entirely. With it, the system gradually learns what each team actually cares about.
3. Show Progress, Not Silence
The first version had no progress indicators. Users would trigger a review and stare at a blank screen wondering if it was broken. Adding WebSocket-based status updates (fetching diff... searching context... generating review...) was a small change that made a huge difference in how the tool felt to use.
4. Abstract Early When You Know You'll Need It
Supporting both GitHub and Bitbucket, and multiple AI providers, forced clean abstractions early. It was more work upfront but made adding new providers straightforward later.
What's Next
I'm working on deeper IDE integrations, better conflict detection for monorepos, and expanding the knowledge base so teams can teach Octopus their domain-specific rules.
Octopus is open source and available on GitHub. It integrates with GitHub and Bitbucket. You can also self-host it if your code can't leave your infrastructure. There's a free tier for smaller teams.
I'd love to hear how other people are approaching AI-assisted code review. Feel free to drop a comment or reach out on X.