CodeGen: Semantic’s improved language support system
The Semantic Code team shipped a massive improvement to the language support system that powers code navigation. Code navigation features only scratch the surface of possibilities that start to open up when we combine Semantic‘s program analysis potential with GitHub’s scale.
The Semantic Code team shipped a massive improvement to the language support system that powers code navigation. Code navigation features only scratch the surface of possibilities that start to open up when we combine Semantic‘s program analysis potential with GitHub’s scale.
GitHub is home to over 50 million developers worldwide. Our team’s mission is to analyze code on our platform and surface insights that empower users to feel more productive. Ideally, this analysis should work for all users, regardless of which programming language they use. Until now, however, we’ve only been able to support a handful of languages due to the high cost of adding and maintaining them. Our new language support system, CodeGen, cuts that cost dramatically by automating a significant portion of the pipeline, in addition to making it more resilient. The result is that it is now easier than ever to add new programming languages that get parsed by our library.
Language Support is mission-critical
Before Semantic can compute diffs or perform abstract interpretation, we need to transform code into a representation that is amenable to our analysis goals. For this reason, a significant portion of our pipeline deals with processing source code from files on GitHub.com into an appropriate representation—we call this our “language support system”.
Adding languages has been difficult
Zooming into the part of Semantic that achieves language support, we see that it involved several development phases, including two parsing steps that required writing and maintaining two separate grammars per language.
Reading the diagram from left to right, our historic language support pipeline:
- Parsed source code into ASTs. A grammar is hand-written for a given language using
tree-sitter
, an incremental GLR parsing library for programming tools. - Read
tree-sitter
ASTs intoSemantic
. Connecting Semantic totree-sitter
‘s C library requires providing an interface to the C source. We achieve this through ourhaskell-tree-sitter
library, which has Haskell bindings totree-sitter
. - Parsed ASTs into a generalized representation of syntax. For these ASTs to be consumable by our Haskell project, we had to translate the
tree-sitter
parse trees into an appropriate representation. This required:- À la carte syntax types: generalization across programming languages Many constructs, such as
If
statements, occur in several languages. Instead of having different representations ofIf
statements for each language, could we reduce duplication by creating a generalized representation of syntax that could be shared across languages, such as a datatype modeling the semantics of conditional logic? This was the reasoning behind creating our hand-written, generalized à la carte syntaxes based on Wouter Swierstra’s Data types à la carte approach, allowing us to represent those shared semantics across languages. For example, this file captures a subset of à la carte datatypes that model expression syntaxes across languages. - Assignment: turning
tree-sitter
‘s representation into a Haskell datatype representation We had to translatetree-sitter
AST nodes to be represented by the new generalized à la carte syntax. To do this, a second grammar was written in Haskell to assign the nodes of thetree-sitter
ASTs onto a generalized representation of syntax modeled by the à la carte datatypes. As an example, here is theAssignment
grammar written for Ruby.
- À la carte syntax types: generalization across programming languages Many constructs, such as
- Performed Evaluation. Next, we captured what it meant to interpret the syntax datatypes. To do so, we wrote a polymorphic type class called
Evaluatable
, which defined the necessary interface for a term to be evaluated.Evaluatable
instances were added for each of the à la carte syntaxes. - Handled Effects. In addition to evaluation semantics, describing the control flow of a given program also necessitates modeling effects. This helps ensure we can represent things like the file system, state, non-determinism, and other effectful computations.
- Validated via tests. Tests for diffing, tagging, graphing, and evaluating source code written in that language were added along the process.
Challenges posed by the system
The process described had several obstacles. Not only was it very technically involved, but it had additional limitations.
- The system was brittle. Each language’s
Assignment
code was tightly coupled to the language’stree-sitter
grammar. This meant it could break at runtime if we changed the structure of the grammar, without any compile-time error. To prevent such errors required tracking ongoing changes in tree-sitter, which was also tedious, manual, and error-prone. Each time a grammar changed, assignment changes had to be made to accommodate new tree-structures, such as nodes that changed names or shifted positions. Because improvements to the underlying grammars required changes toAssignment
—which were costly in terms of time and risky in terms of the possibility of introducing bugs—, our system had inadvertently become incentivized against iterative improvement. - There were no named child nodes.
tree-sitter
‘s syntax nodes didn’t provide us with named child nodes. Instead, child nodes were structured as ordered-lists, without any name indicating the role of each child. This didn’t match Semantic’s internal representation of syntax nodes, where each type of node has a specific set of named children. This meant moreAssignment
work was necessary to compensate for the discrepancy. One concern, for example, was about how we represented comments, which could be any arbitrary node attached to any part of the AST. But if we had named child nodes, this would allow us to associate comments relative to their parent nodes (like if a comment appeared in anif
statement, it could be the first child for thatif
statement node). This would also apply to any other nodes that could appear anywhere within the tree, such as Ruby heredocs. - Evaluation and à la carte sum types were sub-optimal. Taking a step back to examine language support also gave us an opportunity to rethink our à la carte datatypes and the evaluation machinery. À la carte syntax types were motivated by a desire to better share tooling in evaluating common fragments of languages. However, the introduction of these syntax types (and the design of the
Evaluatable
typeclass) did not make our analysis sensitive to minor linguistic differences, or even to relate different pieces of syntax together. We could overcome this by adding language-specific syntax datatypes to be used withAssignment
, along with their accompanyingEvaluatable
instances—but this would defeat the purpose of a generalized representation. This is because à la carte syntax was essentially untyped; it enforced only a minimal structure on the tree. As a result, any given subterm could be any element of the syntax, and not some limited subset. This meant that manyEvaluatable
instances had to deal with error conditions that in practice can’t occur. To make this idea more concrete, consider examples showcasing a before and after syntax type transformation:-- former system: à la carte syntax data Op a = Op { ann :: a, left :: Expression a, right :: Expression a }
-- new system: auto-generated, precisely typed syntax data Op a = Op { ann :: a, left :: Err (Expression a), right :: Err (Expression a) }
The shape of a syntax type in our à la carte paradigm has polymorphic children, compared with the monomorphic children of our new “precisely-typed” syntax, which offers better guarantees of what we could expect.
- Infeasible time and effort was required. A two-step parsing process required writing two separate language-specific grammars by hand. This was time-consuming, engineering-intensive, error-prone, and tedious. The
Assignment
grammar used parser combinators in Haskell mirroring the tree-sitter grammar specification, which felt like a lot of duplicated work. For a long time, this work’s mechanical nature begged the question of whether we could automate parts of it. While we’ve open-sourced Semantic, leveraging community support for adding languages has been difficult because, until recently, it was backed by such a grueling process.
Designing a new system
To address challenges, we introduced a few changes:
- Add named child nodes. To address the issue of not having named child nodes, we modified the
tree-sitter
library by adding a new function calledfield
to the grammar API and resultantly updating every language grammar. When parsing, you can retrieve a nodes’ children based on their field name. Here is an example of what a Pythonif_statement
looks like in the old and newtree-sitter
grammar APIs:
- Generate a Node Interface File. Once a grammar has this way of associating child references, the parser generation code also produces a
node-types.json
file that indicates what kinds of children references you can expect for each node type. This JSON file provides static information about nodes’ fields based on the grammar. Using this JSON, applications like ours can use meta-programming to generate specific datatypes for each kind of node. Here is an example of the JSON generated from the grammar definition of anif
statement. This file provided a schema for a language’s ASTs and introduced additional improvements, such as the way we specify highlighting. - Auto-generate language-specific syntax datatypes. Using the structure provided by the
node-types.json
file, we can auto-generate syntax types instead of writing them by hand. First, we deserialize the JSON file to capture the structure we want to represent in the desired shape of datatypes. Specifically, we have four distinct shapes that the nodes in our node-types JSON file take on: sums, products, named leaves, and anonymous leaves. We then use Template Haskell to generate syntax datatypes for each of the language constructs represented by the Node Interface File. This means that our hand-written à la carte syntax types get replaced with auto-generated language-specific types, saving all of the developer time historically spent writing them. Here is an example of an auto-generated datatype representing a Pythonif
statement derived from the JSON snippet provided above, which is structurally a product type. - Build ASTs generically. Once we have an exhaustive set of language-specific datatypes, we need to have a mechanism that can map appropriate auto-generated datatypes onto the ASTs representing the source code being parsed. Historically, this was accomplished by manually writing an
Assignment
grammar. To obviate the need for a second grammar, we have created an API that uses Haskell’s generic metaprogramming framework to unmarshal tree-sitter’s parse trees automatically. We iterate overtree-sitter
‘s parse trees using its tree cursor API and produce Haskell ASTs, where each node is represented by a Template Haskell generated datatype described by the previous step. This allows us to parse a particular set of nodes according to their structure, and return an AST with meta-data (such as range and span information). Here is an example of the AST generated if the Python source code is simply1
The final result is a set of language-specific, strongly-typed, TH-generated datatypes represented as the sum of syntax possible at a given point in the grammar. Strongly-typed trees give us the ability to indicate only the subset of the syntax that can occur at a given position. For example, a function’s name would be strongly typed as an identifier
; a switch
statement would contain case
statements; and so on. This provides better guarantees about where syntax can occur, and strong compile-time guarantees about both correctness and completeness.
The new system bypasses a significant part of the engineering effort historically required; it cuts code from our pipeline in addition to addressing the technical limitations described above. The diagram below provides a visual “diff” of the old and new systems.
A big testament to our approach’s success was that we were able to remove our à la carte syntaxes completely. In addition, we were also able to ship two new languages, Java and CodeQL, using precise ASTs generated by the new system.
Contributions welcome!
To learn more about how you can help, check out our documentation here.
Written by
Related posts
Unlocking the power of unstructured data with RAG
Unstructured data holds valuable information about codebases, organizational best practices, and customer feedback. Here are some ways you can leverage it with RAG, or retrieval-augmented generation.
GitHub Availability Report: May 2024
In May, we experienced one incident that resulted in degraded performance across GitHub services.
How we improved push processing on GitHub
Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. Read how we have significantly improved the ability of our monolith to correctly and fully process pushes from our users.