Unlocking the power of unstructured data with RAG
Unstructured data holds valuable information about codebases, organizational best practices, and customer feedback. Here are some ways you can leverage it with RAG, or retrieval-augmented generation.
Whether they’re building a new product or improving a process or feature, developers and IT leaders need data and insights to make informed decisions.
When it comes to software development, this data exists in two ways: unstructured and structured. While structured data follows a specific and predefined format, unstructured data—like email, an audio or visual file, code comment, or commit message—doesn’t. This makes unstructured data hard to organize and interpret, which means teams can miss out on potentially valuable insights.
To make the most of their unstructured data, development teams are turning to retrieval-augmented generation, or RAG, a method for customizing large language models (LLMs). They can use RAG to keep LLMs up to date with organizational knowledge and the latest information available on the web. They can also use RAG and LLMs to surface and extract insights from unstructured data.
GitHub data scientists, Pam Moriarty and Jessica Guo, explain unstructured data’s unique value in software development, and how developers and organizations can use RAG to create greater efficiency and value in the development process.
Unstructured data in software development
When it comes to software development, unstructured data includes source code and the context surrounding it, as these sources of information don’t follow a predefined format.
Here are some examples of unstructured data on GitHub:
- README files describe in text the purpose behind project source code, and include instructions for source code use, how to contribute, and other details that developers decide is important to include. While they’re usually written in Markdown, README files don’t follow a predefined structure.
- Code files are more orderly than README files in that they follow the syntax of a programming language. But not all code files have the exact same fields nor are they all written in the same format. Additionally, some parts of the file, like coding logic and variable names, are decided by individual developers.
- Package documentation explains how the software works and how to use it. Documentation, written in natural language, can include installation instructions, troubleshooting tips, a description of the package’s API, and a list of any dependencies required to use the package. It can also include code snippets that highlight the package’s features.
- Code comments explain the function behind certain code blocks in a code file. They’re text comments written in natural language and make the source code easier to understand by other developers.
- Wiki pages, while not limited to unstructured data, can contain helpful text documentation about installation instructions, API references, and other information.
- Commit messages describe in natural language text the changes a developer made to a codebase and why.
- Issue and pull request descriptions are written in natural language and in a text field. They can contain any kind of information a developer chooses to include about a bug, feature request, or general task in a project.
- Discussions contain a wealth and variety of information, from developer and end- user feedback to open-ended conversations about a topic. As long as a repository enables discussions, anyone with a GitHub account can start a discussion.
- Review comments are where developers can discuss changes before they’re merged into a codebase. Consequently, they contain information in natural language about code quality, context behind certain decisions, and concerns about potential bugs.
The value of unstructured data
The same features that make unstructured data valuable also make it hard to analyze.
Unstructured data lacks inherent organization, as it often consists of free-form text, images, or multimedia content.
“Without clear boundaries or predefined formats, extracting meaningful information from unstructured data becomes very challenging,” Guo says.
But LLMs can help to identify complex patterns in unstructured data—especially text. Though not all unstructured data is text, a lot of text is unstructured. And LLMs can help you to analyze it.
“When dealing with ambiguous, semi-structured or unstructured data, LLMs dramatically excel at identifying patterns, sentiments, entities, and topics within text data and uncover valuable insights that might otherwise remain hidden,” Guo explains.
Need a refresher on LLMs? Check out our AI explainers, guides, and best practices > |
Here are a few reasons why developers and IT leaders might consider using RAG-powered LLMs to leverage unstructured data:
- Surface organizational best practices and establish consistency. Through RAG, an LLM can receive a prompt with additional context pulled from an organization’s repositories and documents. So, instead of sifting through and piece-mealing documents, developers can quickly receive answers from an LLM that align with their organization’s knowledge and best practices.
-
Accelerate and deepen understanding of an existing codebase—including its conventions, functions, common issues, and bugs. Understanding and familiarizing yourself with code written by another developer is a persisting challenge for several reasons, including but not limited to: code complexity, use of different coding styles, a lack of documentation, use of legacy code or deprecated libraries and APIs, and the buildup of technical debt from quick fixes and workarounds.
RAG can help to mediate these pain points by enabling developers to ask and receive answers in natural language about a specific codebase. It can also guide developers to relevant documentation or existing solutions.
Accelerated and deepened understanding of a codebase enables junior developers to contribute their first pull request with less onboarding time and senior developers to mitigate live site incidents, even when they’re unfamiliar with the service that’s failing. It also means that legacy code suffering from “code rot” and natural aging can be more quickly modernized and easily maintained.
Unstructured data doesn’t just help to improve development processes. It can also improve product decisions by surfacing user pain points.
Moriarty says, “Structured data might show a user’s decision to upgrade or renew a subscription, or how frequently they use a product or not. While those decisions represent the user’s attitude and feelings toward the product, it’s not a complete representation. Unstructured data allows for more nuanced and qualitative feedback, making for a more complete picture.”
A lot of information and feedback is shared during informal discussions, whether those discussions happen on a call, over email, on social platforms, or in an instant message. From these discussions, decision makers and builders can find helpful feedback to improve a service or product, and understand general public and user sentiment.
What about structured data?
Contrary to unstructured data, structured data—like relational databases, Protobuf files, and configuration files—follows a specific and predefined format.
We’re not saying unstructured data is more valuable than structured. But the processes for analyzing structured data are more straightforward: you can use SQL functions to modify the data and traditional statistical methods to understand the relationship between different variables.
That’s not to say AI isn’t used for structured data analysis. “There’s a reason that machine learning, given its predictive power, is and continues to be widespread across industries that use data,” according to Moriarty.
However, “Structured data is often numeric, and numbers are simply easier to analyze for patterns than words are,” Moriarty says. Not to mention that methods for analyzing structured data have been around longer** **than those for analyzing unstructured data: “A longer history with more focus just means there are more established approaches, and more people are familiar with it,” she explains.
That’s why the demand to enhance structured data might seem less urgent, according to Guo. “The potential for transformative impact is significantly greater when applied to unstructured data,” she says.
How does RAG extract value from unstructured data?
With RAG, an LLM can use data sources beyond its training data to generate an output.
RAG is a prompting method that uses retrieval—a process for searching for and accessing information—to add more context to a prompt that generates an LLM response.
This method is designed to improve the quality and relevance of an LLM’s outputs. Additional data sources include a vector database, traditional database, or search engine. So, developers who use an enterprise AI tool equipped with RAG can receive AI outputs customized to their organization’s best practices and knowledge, and proprietary data.
We break down these data sources in our RAG explainer, but here’s a quick summary:
- Vector databases. While you code in your IDE, algorithms create embeddings for your code snippets, which are stored in a vector database. An AI coding tool can search that database to find snippets from across your codebase that are similar to the code you’re currently writing and generate a suggestion.
And when you’re engaging with GitHub Copilot Chat on GitHub.com or in the IDE, your query or code is transformed into an embedding. Our retrieval service then fetches relevant embeddings from the vector database for the repository you’ve indexed. These embeddings are turned back into text and code when they’re added to the prompt as additional context for the LLM. This entire process leverages unstructured data, even though the retrieval system uses embeddings internally.
- General text search. When developers engage with GitHub Copilot Chat under a GitHub Copilot Enterprise plan, they can index repositories—specifically code and documentation. So, when a developer on GitHub.com or in the IDE asks GitHub Copilot Chat a question about an indexed repository, the AI coding tool can retrieve data from all of those indexed, unstructured data sources. And on GitHub.com, GitHub Copilot Chat can tap into a collection of unstructured data in Markdown files from across repositories, which we call knowledge bases.
Learn about GitHub Copilot Enterprise features >
But wait, why is Markdown considered unstructured data? Though you can use Markdown to format a file, the file itself can contain essentially any kind of data. Think about it this way: how would you put the contents of a Markdown file in a table?
- External or internal search engine. The retrieval method searches and pulls information from a wide range of sources from the public web or your internal platforms and websites. That information is used for RAG, which means the AI model now has data from additional files—like text, image, video, and audio—to answer your questions.
Retrieval also taps into internal search engines. So, if a developer wants to ask a question about a specific repository, they can index the repository and then send their question to GitHub Copilot Chat on GitHub.com. Retrieval uses our internal search engine to find relevant code or text from the indexed files, which are then used by RAG to prompt the LLM for a contextually relevant response.
Stay smart: LLMs can do things they weren’t trained to do, so it’s important to always evaluate and verify their outputs.
Use RAG to unlock insights from unstructured data
As developers improve their productivity and write more code with AI tools like GitHub Copilot, there’ll be even more unstructured data. Not just in the code itself, but also the information used to build, contextualize, maintain, and improve that code.
That means even more data containing rich insights that organizations can surface and leverage, or let sink and disappear.
Developers and IT leaders can use RAG as a tool to help improve their productivity, produce high-quality and consistent code at greater speed, preserve and share information, and increase their understanding of existing codebases, which can impact reduced onboarding time.
With a RAG-powered AI tool, developers and IT leaders can quickly discover, analyze, and evaluate a wealth of unstructured data—simply by asking a question.
A RAG reading list 📚
- What is retrieval-augmented generation, and what does it do for generative AI?
- Customizing and fine-tuning LLMs: What you need to know
- How we’re experimenting with LLMs to evolve GitHub Copilot
- How GitHub Copilot is getting better at understanding your code
Tags:
Written by
Related posts
GitHub Availability Report: May 2024
In May, we experienced one incident that resulted in degraded performance across GitHub services.
How we improved push processing on GitHub
Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.
How GitHub reduced testing time for iOS apps with new runner features
Learn how GitHub used macOS and Apple Silicon runners for GitHub Actions to build, test, and deploy our iOS app faster.