Applying machine intelligence to GitHub security alerts

Learn how we use machine learning to power and build on security alerts and make GitHub more secure.

October 9, 2018 | Updated February 16, 2022

| 2 minutes

Last year, we released security alerts that track security vulnerabilities in Ruby and JavaScript packages. Since then, we’ve identified more than four million of these vulnerabilities and added support for Python. In our launch post, we mentioned that all vulnerabilities with CVE IDs are included in security alerts, but sometimes there are vulnerabilities that are not disclosed in the National Vulnerability Database. Fortunately, our collection of security alerts can be supplemented with vulnerabilities detected from activity within our developer community.

Leveraging the community

There are many places a project can publicize security fixes within a new version: the CVE feed, various mailing lists, and open source groups, or even within its release notes or changelog. Regardless of how projects share this information, some developers within the GitHub community will see the advisory and immediately bump their required versions of the dependency to a known safe version. If detected, we can use the information in these commits to generate security alerts for vulnerabilities which may not have been published in the CVE feed.

On an average day, the dependency graph can track around 10,000 commits to dependency files for any of our supported languages. We can’t manually process this many commits. Instead, we depend on machine intelligence to sift through them and extract those that might be related to a security release.

For this purpose, we created a machine learning model that scans text associated with public commits (the commit message and linked issues or pull requests) to filter out those related to possible security upgrades. With this smaller batch of commits, the model uses the diff to understand how required version ranges have changed. Then it aggregates across a specific timeframe to get a holistic view of all dependencies that a security release might affect. Finally, the model outputs a list of packages and version ranges it thinks require an alert and currently aren’t covered by any known CVE in our system.

Always quality focused

No machine learning model is perfect. While machine intelligence can sift through thousands of commits in an instant, this anomaly-detection algorithm will still generate false positives for packages where no security patch was released. Security alert quality is a focus for us, so we review all model output before the community receives an alert.

Learn more

Interested in learning more? Join us at GitHub Universe next week to explore the connections that push technology forward and keep projects secure through talks, trainings, and workshops. Tune in to the blog October 16-17 for more updates and announcements.

Written by

Engineering

Applying machine intelligence to GitHub security alerts

Leveraging the community

Always quality focused

Learn more

Tags:

Written by

Ben Thompson

Related posts

How GitHub engineers tackle platform problems

GitHub Issues search now supports nested queries and boolean operators: Here’s how we (re)built it

Design system annotations, part 2: Advanced methods of annotating components

Tags:

Written by

Related posts

We do newsletters, too