The future of code reviews in an AI world

Before we try to predict the AI-driven future, let’s first look at the present. There are three motivations for consistently doing code reviews in a team:

Reduce cost by removing faulty code before deploying to production.

If it takes a human 20 minutes to prevent hours of downtime and considerable losses in revenue, that’s time well spent.

Prevent growth of technical debt.

If a human can spot architectural choices which will lead to worse technical outcomes in the future, the benefit of finding them in a review is high.

Build a shared understanding of the code base.

Teams work best when they have all built a cognitive model of both an application’s source code and its runtime behaviors: how requests are processed, how data flows through it, how errors are handled, and so on.

If an AI helps in any of those dimensions, it will free humans up to spend more time with the others and will likely get adopted.

The biggest challenge with human code reviews

Of the many issues teams face using code reviews as “quality gates,” one stands out far above the rest.

When faced with a large code change (for example, a Pull Request with hundreds of files changed), a reviewer is less likely to review it thoroughly. The quality and engagement level of human code reviews is sadly inversely proportional to the size of the code change.

A code reviewer inspecting a 4,200-line Pull Request

Many developers will recognize this phenomenon. Submitting a 7-line code change could invite intense scrutiny and debate from the reviewer. Still, the same colleague might wave a substantial change through with a simple “LGTM” because they don’t have the time to thoroughly inspect it, and the organizational pressures to ship can be too much to resist.

High-performing teams currently try to avoid this problem by making small batches of changes, but that’s only sometimes possible. Any advancement in developer AI that allows larger code changes to get the same kind of in-depth code review as small ones will be enormously popular.

Determinism is desirable

There’s a pattern we can spot in existing developer tools:

Test frameworks run banks of tests and report on their success
Linters check code style and consistency of source layout
Coverage checkers measure how well the code base is tested
Dependency validators spot 3rd party libraries that have known vulnerabilities

These commodity tools are used by the vast majority of development teams. What they have in common is their determinism: they produce reliable reports that can a reviewer can use to judge the quality of the code changes.

Even with their deterministic behavior, these tools still leave a great deal of probabilistic work on the desk of the code reviewer. For example, a dependency validator can deterministically report that an embedded 3rd party library contains a vulnerability. Still, a human has to determine if that library is used in a vulnerable way.

AI for determinism, humans for probabilism

As the use of AI in code review progresses, its ability to deterministically reveal code flaws will increase.

Put another way, the last thing code reviewers want is a tool that reports even more false positives and “fuzzy maybes”, or causes them to waste time digging into AI-reported issues that are not concerns at all.

Developers are busy, impatient, precise, and technically demanding and will not tolerate or adopt tools that create more work for them. There’s a reason why IAST and DAST tools have been unable to “shift left” as they promised: apart from taking far too long to run, they report too many false positives.

Suppose AI is to succeed within the world of code reviews. In that case, it must do more things deterministically, leaving the human to apply their brainpower to only the most difficult, creative, and subjective problems.

New determinism needs new data

Countless startups have recently appeared, all excitedly offering “AI-assisted code reviews”. Most simply send code changes to an AI based on a Large Language Model (LLM), such as OpenAI’s GPT. The AI summarizes the changes, and the human code reviewer gets a summary of the changes like this:

This change adds an index to the Users table, as well as a new text column in the same table named category.

Is it helpful for code reviewers to see such a concise summary? Assuming it’s accurate, of course it is! Any worthwhile code review starts with understanding its intent, scope, and purpose.

Unfortunately, these technologies only have the “source diff” of the Pull Request or branch to work from. Modern AI tools are inherently limited to their inputs and the models upon which they are based, and these products do not expand either of them. Therefore, it is no surprise that GitHub announced Copilot X to offer AI-assisted code reviews because “gluing” a source diff to an LLM to produce PR summaries is about to become a commodity offering.

Runtime data is the next AI frontier

We are quickly reaching the ceiling of what an AI can do when examining code change listings, so new data are needed for a rapid AI expansion to occur within the domain of code reviews.

What is missing is runtime data. An AI’s ability to reason about the impact of code changes will radically expand with information about how an application behaves when it is executed. An AI using runtime data will make enormous strides, offering code reviewers deterministic reports on performance, security, reliability, architecture, and maintainability issues.

Where should this runtime data come from? It can’t come from production systems because a developer must deploy the code first, defeating the point of code reviews and preventing production problems in the first place.

Therefore, dynamic data should come from running the application before it gets deployed, either on the developer’s machine or in a central build environment. This is the approach we take at AppMap and forms the basis of how we build recordings of runtime behavior with deterministic reports for code reviewers that show behavioral differences and flaws.

How will AI-driven code reviews look?

If we imagine a world where AI is doing what AI does best, and humans are doing what humans do best, code reviews will look very different than they do today:

AIs will catch tremendous numbers of potential performance and stability issues that could have impacted production systems - before they even get reviewed by a human.
Rates of introduction of new security flaws will drop as developers quickly and deterministically spot code changes.
In-depth human reviews of larger size code changes will be possible and be less susceptible to the “LGTM effect”.
Developers will more quickly onboard into new teams, as AIs will act like a dedicated mentor to help them build a shared understanding of their system’s behaviors as each change is reviewed and merged.
Technical debt reduction guidelines and architectural directions will progressively become “encoded” in a team’s AI, reducing the time needed to enforce them.
Cycle times and change failure rates will reduce, while overall change flow rates should increase.

The recent advancements in AI are unlike anything else in recent software history. Applying AI to the challenge of code reviews will bring enormous benefits to the tech industry and become a virtuous cycle of quality that benefits everyone.

Posted 18 Apr 2023 by Brian Kelly