Back to Blog Cover image for "Building an AI Code Review Tool: Architecture and Lessons Learned"

Building an AI Code Review Tool: Architecture and Lessons Learned

Ferit·March 24, 2026

Code reviews are one of those things every team agrees are important but nobody enjoys waiting for. You open a pull request, your reviewer is heads-down on something else, and the PR sits there. When the review does come, the quality depends on who reviewed it, how busy they were, and whether it was 10am or 5pm.

I've been thinking about this problem for a while. Last year I started building Octopus, a tool that uses LLMs and vector search to automatically review pull requests and post findings as inline comments. It's not meant to replace human reviewers, but to handle the repetitive stuff so humans can focus on architecture and design decisions.

This post is a walkthrough of how it works under the hood, what I built, what was harder than expected, and what I'd do differently.

Why I Started Building This

The problems with code reviews are well-known:

Turnaround time. PRs waiting for review block other work. The longer they sit, the more context everyone loses.
Inconsistency. Different reviewers catch different things. Standards drift over time.
Scaling issues. As teams grow, senior developers become review bottlenecks.

I didn't set out to "solve" code review. That's too broad. I wanted to build something that could catch common issues consistently and give developers faster feedback on their PRs.

The Architecture: From Webhook to Review Comment

The core pipeline is surprisingly straightforward. Here's what happens when you open a PR:

GitHub/Bitbucket Webhook
    → Fetch PR diff
    → Search codebase for relevant context (vector search)
    → Send diff + context to LLM
    → Parse and filter findings
    → Post inline comments on the PR

Let me break down each step.

Step 1: Catching the Webhook

When a pull request is opened or updated, GitHub sends a webhook to Octopus. The handler verifies the HMAC-SHA256 signature, extracts the diff, and kicks off the review pipeline.

I also handle issue_comment events so that when someone comments @octopus review this, the system triggers a manual review. This turned out to be one of the most-used features. Teams love being able to re-trigger reviews after making changes.

Step 2: Indexing the Codebase

Before the AI can review code intelligently, it needs to understand the entire codebase, not just the diff. This is where things get interesting.

When a repository connects to Octopus for the first time, we clone it and split the code into chunks of 1,500 characters with 200-character overlaps. Each chunk gets embedded using OpenAI's text-embedding-3-large model (3,072 dimensions) and stored in Qdrant, a vector database.

Why 1,500 characters? Too small and you lose context. Too large and the embeddings become noisy. After experimenting with different sizes, 1,500 with 200-char overlap gave the best retrieval accuracy for code.

We store five types of vector collections:

code_chunks: the indexed repository code
knowledge_chunks: organization-wide knowledge base
review_chunks: past reviews (for learning)
chat_chunks: conversation history
diagram_chunks: architecture diagrams

Step 3: Hybrid Search for Context

When a PR comes in, we take the first 8,000 characters of the diff and embed them as a query vector. Then we run a hybrid search against Qdrant, combining dense vectors (semantic similarity) with sparse vectors (BM25-like keyword matching) using Reciprocal Rank Fusion.

The sparse vector generation was a fun challenge. I implemented a custom tokenizer that splits camelCase and snake_case identifiers, filters stop words, and uses FNV-1a hashing across a 262K hash space with log-normalized TF scoring.

After the initial search returns ~50 results, we rerank them using Cohere's rerank-v3.5 model. This step is crucial. It takes the rough results from vector search and re-scores them based on actual relevance to the diff. We keep the top 15 code chunks and top 8 knowledge chunks.

This two-stage retrieval (search then rerank) made a noticeable difference in context quality compared to vector search alone.

Step 4: The LLM Review

This is the core step. We send the diff, the retrieved code context, the repo's file tree, and the organization's knowledge base to the LLM.

Octopus supports multiple AI providers:

Anthropic Claude (default, Claude Sonnet 4)
OpenAI (GPT-4o, o1, o3)
Google Gemini

Organizations can configure which model to use at the org or repo level. We also support bringing your own API keys, so if you want to use your own Anthropic account, you can.

One optimization worth mentioning: prompt caching. Claude supports caching the system prompt across requests. Since our system prompt (which includes the codebase context) is large but changes infrequently per repo, caching saves a lot of tokens on repeated reviews.

Step 5: Parsing and Filtering Findings

The LLM returns findings with severity levels:

🔴 Critical: Security vulnerabilities, data loss risks
🟠 High: Bugs that will cause issues in production
🟡 Medium: Code quality issues, potential problems
🔵 Low: Minor improvements, style suggestions
💡 Info: Educational notes, best practices

But raw LLM output needs filtering. We apply several layers:

Confidence threshold: only keep HIGH and MEDIUM confidence findings
Disabled categories: respect org/repo preferences
Semantic feedback matching: if a user previously dismissed a similar finding (👎), suppress it. This uses embedding similarity with a 0.85 threshold
Optional two-pass validation: a second LLM call confirms each finding against the actual diff
Cap at 30 findings, prioritized by severity

The feedback learning loop is one of my favorite features. When developers react to findings with 👍 or 👎, Octopus learns. Over time, false positives decrease because the system remembers what your team considers noise.

Step 6: Posting Results

Findings are posted as inline PR comments using GitHub's Review API. Each comment appears exactly on the relevant line, just like a human review. We also post a summary table at the top of the review.

For Bitbucket, the same pipeline works through a provider abstraction layer. One codebase, two platforms.

The Tech Stack

Here's what powers Octopus:

Layer	Technology
Framework	Next.js 16 (App Router)
Language	TypeScript (strict mode)
Runtime	Node.js + Bun
Database	PostgreSQL + Prisma ORM
Vector DB	Qdrant (self-hosted)
Auth	Better Auth (Google, GitHub OAuth, magic link)
AI	Anthropic Claude, OpenAI, Google Gemini
Reranking	Cohere rerank-v3.5
Real-time	Pubby (WebSocket)
Payments	Stripe
UI	shadcn/ui + Radix + Tailwind CSS 4
Monorepo	Turborepo

Why This Stack?

Next.js 16 with App Router gave me server components, server actions, and API routes in one place. For a tool that needs both a dashboard UI and webhook endpoints, this was perfect.

Qdrant over Pinecone because I wanted self-hosting capability. Enterprise customers with strict security requirements need their code to stay on their infrastructure. Qdrant's hybrid search (dense + sparse vectors) was also a differentiator.

Better Auth over NextAuth because it has first-class Prisma support and a plugin system that made adding features like magic links straightforward.

Pubby for real-time because reviews aren't instant. They take 15-60 seconds. Users need to see progress. We push real-time status updates (fetching-diff → searching-context → generating-review → posting-comment) through organization-scoped WebSocket channels so the dashboard shows live progress.

Hard Problems I Had to Solve

The Context Window Problem

An LLM can't review code well if it only sees the diff. It needs to understand how the changed code relates to the rest of the codebase. But you can't stuff an entire repository into a context window.

The solution was the hybrid search + reranking pipeline I described above. Instead of sending everything, we send the most relevant parts of the codebase. The AI gets enough context to understand the changes without hitting token limits.

False Positive Management

Early versions of Octopus were noisy. The AI would flag things that weren't actually problems, and developers would start ignoring all findings. This is the worst possible outcome for a review tool.

The feedback loop solved this. But I also added:

Semantic deduplication: don't post the same finding twice on re-reviews
Re-review mode: on follow-up reviews, only show critical findings
Configurable severity thresholds: teams can choose which severity levels get inline comments vs. go into the summary

Cost Management

LLM calls are expensive. A single review can cost $0.05-0.50 depending on the diff size and model. At scale, this adds up fast.

I built a cost tracking system that logs every token (input, output, cached reads, cached writes) per organization, repository, and operation type. Organizations have configurable monthly spend limits (default $150/month). Before every expensive operation, the system checks isOrgOverSpendLimit().

Prompt caching on Claude helped a lot here. The system prompt with codebase context gets cached, so repeated reviews on the same repo reuse the cached prompt and only pay for the diff portion.

Build Artifact Detection

You'd be surprised how many PRs accidentally include node_modules/, dist/ folders, or package-lock.json changes. Octopus automatically detects these with regex patterns and flags them as critical findings before running the full review. Simple but saves teams a lot of headache.

What I Learned

1. Retrieval Quality > Model Quality

Switching from GPT-4 to Claude improved review quality. But adding Cohere reranking on top of vector search improved it even more. The context you feed the LLM matters at least as much as which LLM you pick, probably more.

2. Feedback Loops Are Everything

The 👍/👎 system seemed like a nice-to-have when I built it. It turned out to be critical for long-term adoption. Without it, false positives pile up and people start ignoring the tool entirely. With it, the system gradually learns what each team actually cares about.

3. Show Progress, Not Silence

The first version had no progress indicators. Users would trigger a review and stare at a blank screen wondering if it was broken. Adding WebSocket-based status updates (fetching diff... searching context... generating review...) was a small change that made a huge difference in how the tool felt to use.

4. Abstract Early When You Know You'll Need It

Supporting both GitHub and Bitbucket, and multiple AI providers, forced clean abstractions early. It was more work upfront but made adding new providers straightforward later.

What's Next

I'm working on deeper IDE integrations, better conflict detection for monorepos, and expanding the knowledge base so teams can teach Octopus their domain-specific rules.

Octopus is open source and available on GitHub. It integrates with GitHub and Bitbucket. You can also self-host it if your code can't leave your infrastructure. There's a free tier for smaller teams.

I'd love to hear how other people are approaching AI-assisted code review. Feel free to drop a comment or reach out on X.