.png)
This is the first in a series of blogs describing how to use AI to illuminate codebases for engineering managers. They need to know their codebases without having the time or resources to be constantly auditing it themselves, or even having a granular enough understanding to know where to look. The semi-structured nature of code lends itself nicely to being evaluated by LLMs, particularly in the area of code understanding relevant to the managers. LLMs have been trained on massive amounts of code, allowing them to be familiar with an array of programming languages, syntaxes, and common coding patterns. Additionally, the generative capacity of LLMs also provides very human-understandable answers. On the surface, this is a perfect use case for AI.
However, engineering managers need consistent and reliable results, generally requiring more reliability than LLMs will supply by default. A manager needs to know what’s in the code without taking the time to look through it in great detail herself, and certainly without continuously bugging her engineers. Day-to-day, engineering managers need to ensure the code is high quality, compliant with corporate policies, and not exposing the organization to compliance and privacy risks. Thus, they need both precision (i.e. the report needs to be correct) and recall (to ensure no important issues are missed). Missing a potential problem can expose them to risks, requiring them to ask an engineer to comb over all of the code or use a different backup method, thus annulling the productivity gains of using an LLM. Finding many false positives buries the signal in the noise and causes engineers to chase non-existent problems. Too much of these hallucinated problems cause alarm fatigue.
However, many use cases necessitate a very high degree of certainty that nothing has been missed, e.g. an open source library being used in violation of its licensing terms or some very poor quality code being submitted by contractors. Concerned that the LLM will miss important information, engineers are forced to review the code with other methods--often manually. This is especially true for audits, like SOC2 or FedRAMP certifications. On the other hand, hallucinations and other LLM issues will cause them to waste engineer time with wild goose chases.
For very simple questions about code, RAG retrieval using a simple similarity search might be sufficient. We have a natural language query, e.g., questions about the user interface. The similarity search finds chunks of code and documentation that contain tokens that are semantically similar to “user interface, such as those related to CSS, layout, and other frontend components.” Some of these chunks will likely contain the required information (and there will certainly be noisy chunks, as well), which the LLM will consume and use to answer the prompted question.
While simple RAG can sometimes work, it can also fail on simple questions for a variety of reasons. It has some parameters to be optimized, for example, deciding on embedding algorithms, similarity search metrics, and chunk sizes. Semantically matching the short natural language query strings with the code can be complicated. Even intelligently ordering the documents can help improve the result, rather than just whatever order they spill out of the similarity search. But this simple method can be somewhat effective, and sometimes is sufficient for very simple queries.
Occasionally, an even more unsophisticated approach is possible: omitting semantic search. Rather than filtering to the relevant code chunks, the entire code repo can be shoved into the context window. Obviously, these huge queries have costs in terms of money and latency. More importantly, there is also a huge degradation of results as the context window gets bigger because the irrelevant content will dominate the signal. Like humans, LLMs perform best when they are given just the relevant context to answer a question.
For both of these simplistic approaches, recall can be quite poor. For simple RAG, the number of chunks of code and their size may be insufficient to acquire all of the information to answer the question. For example, imagine a user striving to ensure the open source libraries in their code are compliant with their licenses. Similarity search returns a finite set of documents, typically either a constant number or those who are above a certain similarity threshold. If the number of retrieved documents is too low, which is likely in a repo of any significant size, some dependencies will be excluded. In the single-context approach, the context window will have all the code, but the LLM is likely to be confused by the huge amount of irrelevant information, as well as potentially limited in reporting all of the offending libraries by output token sizes. In either case, there could be a large risk to missing a library that causes legal trouble.
While these naive approaches may be optimal for a scenario in which it’s not possible to know which is the relevant context, when trying to understand a codebase—which is more structured—our approaches at Flux are much more precise. For engineering managers, this is vital. The type of superficial questions that can be answered with the simplistic methods are not sufficiently interesting or robust to be trusted and useful. For given prompts, Flux leverages domain knowledge to choose the relevant code without the bluntness of overloading context windows and mismatched semantic search. For questions we use for evaluation, we know which code is relevant, and can properly aggregate and package it for the LLM. Thus, we are able to do a targeted retrieval of the proper information, as well as excluding irrelevant noise.
Proper evaluation involves using the right tool for the job. For example, some repo evaluation is best done by static analysis tools. Often, properly evaluating a repo involves evaluating smaller, semantically-meaningful segments, and then aggregating those intelligently. Many times, it involves creating an ensemble of approaches. In the next blog, I’ll explain how to apply this domain knowledge in a compound approach. In addition to reliability, this will greatly expand both the depth and breadth of the evaluation questions, as well as unbound the repo size that can be assessed.
Rachel Lomasky is the Chief Data Scientist at Flux, where she continuously identifies and operationalizes AI so Flux users can understand their codebases. In addition to a PhD in Computer Science, Rachel applies her 15+ years of professional experience to augment generative AI with classic machine learning. She regularly organizes and speaks at AI conferences internationally - keep up with her via LinkedIn.