Applying machine intelligence to GitHub security alerts
Learn how we use machine learning to power and build on security alerts and make GitHub more secure.
Last year, we released security alerts that track security vulnerabilities in Ruby and JavaScript packages. Since then, we’ve identified more than four million of these vulnerabilities and added support for Python. In our launch post, we mentioned that all vulnerabilities with CVE IDs are included in security alerts, but sometimes there are vulnerabilities that are not disclosed in the National Vulnerability Database. Fortunately, our collection of security alerts can be supplemented with vulnerabilities detected from activity within our developer community.
Leveraging the community
There are many places a project can publicize security fixes within a new version: the CVE feed, various mailing lists, and open source groups, or even within its release notes or changelog. Regardless of how projects share this information, some developers within the GitHub community will see the advisory and immediately bump their required versions of the dependency to a known safe version. If detected, we can use the information in these commits to generate security alerts for vulnerabilities which may not have been published in the CVE feed.
On an average day, the dependency graph can track around 10,000 commits to dependency files for any of our supported languages. We can’t manually process this many commits. Instead, we depend on machine intelligence to sift through them and extract those that might be related to a security release.
For this purpose, we created a machine learning model that scans text associated with public commits (the commit message and linked issues or pull requests) to filter out those related to possible security upgrades. With this smaller batch of commits, the model uses the diff to understand how required version ranges have changed. Then it aggregates across a specific timeframe to get a holistic view of all dependencies that a security release might affect. Finally, the model outputs a list of packages and version ranges it thinks require an alert and currently aren’t covered by any known CVE in our system.
Always quality focused
No machine learning model is perfect. While machine intelligence can sift through thousands of commits in an instant, this anomaly-detection algorithm will still generate false positives for packages where no security patch was released. Security alert quality is a focus for us, so we review all model output before the community receives an alert.
Learn more
Interested in learning more? Join us at GitHub Universe next week to explore the connections that push technology forward and keep projects secure through talks, trainings, and workshops. Tune in to the blog October 16-17 for more updates and announcements.
Tags:
Written by
Related posts
Unlocking the power of unstructured data with RAG
Unstructured data holds valuable information about codebases, organizational best practices, and customer feedback. Here are some ways you can leverage it with RAG, or retrieval-augmented generation.
GitHub Availability Report: May 2024
In May, we experienced one incident that resulted in degraded performance across GitHub services.
How we improved push processing on GitHub
Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.