Highlights from Git 2.28

The open source Git project just released Git 2.28 with features and bug fixes from over 58 contributors, 13 of them new. We last caught up with you on the…

|
| 10 minutes

The open source Git project just released Git 2.28 with features and bug fixes from over 58 contributors, 13 of them new. We last caught up with you on the latest in Git back when 2.26 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Introducing init.defaultBranch

When you initialize a new Git repository from scratch with git init, Git has always created an initial first branch with the name master. In Git 2.28, a new configuration option, init.defaultBranch is being introduced to replace the hard-coded term. (For more background on this change, this statement from the Software Freedom Conservancy is an excellent place to look).

Starting in Git 2.28, git init will instead look to the value of init.defaultBranch when creating the first branch in a new repository. If that value is unset, init.defaultBranch defaults to master. Here, it’s important to note that:

  1. This configuration variable can be set by the user, and overriding the default value is as easy as:
    $ git config --global init.defaultBranch main
    
  2. This configuration variable only affects new repositories, and does not cause branches in existing projects to be renamed. git clone will also continue to respect the HEAD of the repository you’re cloning from, so you won’t see a change in branch names until a maintainer initiates one.

This change supports the many communities, both on GitHub and in the wider Git community, who are considering renaming the default branch name of their repository from master.

To learn more about the complementary changes GitHub is making, see github/renaming. GitLab and Bitbucket are also making similar changes.

[source]

Changed-path Bloom filters

In Git 2.27, the commit-graph file format was extended to store changed-path Bloom filters. What does all of that mean? In a sense, this new information helps Git find points in history that touched a given path much more quickly (for example, git log -- <path>, or git blame). Git 2.28 takes advantage of these optimizations to deliver a handful of sizeable performance improvements.

Before we get into all of that, it’s worth taking a refresher through commit graphs whether you’re new to the concept, or familiar with them. (If you are familiar, and want to take a deeper dive, check out this blog post explaining all of the juicy technical details).
In the very simplest terms, the commit-graph file stores information about commits. In essence, the commit-graph acts like a cache for commonly-accessed information about commits: who their parent(s) are, what their root tree is, and things like that. It also stores computed information, too, like a commit’s generation number, and changed-path Bloom filters (more on that in just a moment).

Why store all of this information? To understand the answer to this, it is helpful to have a cursory understanding of how Git stores objects. Git stores objects in one of two ways: either as a loose object (in which case the object’s contents are stored in a single file unique to that object on disk), or as a packed object (in which case the object is assembled from a compressed format in a *.pack file). No matter which way a commit is stored, we still have to parse and decompress it before its fields like “root tree” and “parents” can be accessed.

With a commit-graph file, all of that information is immediate: for a given commit C, Git knows exactly where to look in a commit-graph file for all of those fields that we store, and can read them off immediately, no decompression or piecing together required. This can shave some time off your usual Git operations by itself, but where the commit-graph really shines is in the computed data it stores.

Generation numbers are a sort of reachability index that can help Git answer questions about things like reachability and topological ordering very quickly. Since generation numbers aren’t new in this release (and trying to explain them quickly would lose a lot of the benefit of a more careful exposition), I’ll refer you instead to this blog post by freshly-minted Hubber Derrick Stolee on the matter.

What’s new in 2.28?

OK, if you’ve made it this far, you’ve got a pretty good handle on what commit graphs are, and what they’re useful for. Now, let’s get to the juicy details. In Git 2.27, the commit-graph file learned how to store changed-path Bloom filters. What are changed-path Bloom filters, you ask? A Bloom filter is a probabilistic set; that is it’s a set of items, but querying that set for the presence of some item x returns either “x is definitely not in this set” or “x might be in this set”, but never “x is definitely in this set”. The commit-graph stores one of these Bloom filters for commits that reside in the commit-graph, and it populates that Bloom filter with a list of paths changed between that commit and its first parent.

These Bloom filters are a huge boon for performance in lots of Git commands. The general pattern is something like: if you have a Git command that computes diffs (which can sometimes be proportionally expensive), then having Bloom filters allows Git to compute far fewer diffs by skipping the computation for certain commits when their Bloom filters return “definitely not” for paths of interest.

Take git log -- /path/to/file, for example. Prior to Git 2.27, git log would have to compute a diff over every revision in its walk before determining whether or not to show it (i.e., whether or not that diff has any entries for /path/to/file). In Git 2.27 and newer, Git can skip computing many of those diffs altogether by consulting each commit C‘s changed-path Bloom filter and querying it for /path/to/file. Again: if querying returns “definitely not”, then Git knows that computing that diff is strictly uninteresting.

Because computing diffs between commits can be expensive (at least, relative to the complexity of the algorithm for which they are being generated), reducing the number of diffs computed overall can greatly improve performance.

To try this for yourself, you can run the command:

$ git commit-graph write --reachable --changed-paths

This generates a commit-graph file with changed path Bloom filters enabled.[1] You should be able to see performance improvements in commands like git log -- <path>, git log -L, git blame, and anything else that computes first-parent diffs against a given pathspec.

[source, source, source]

Tidbits

Now that we’ve talked about a few of the headlining changes from the past couple of releases, let’s look at a few more new features 🔎

  • Have you ever been looking for the parts of history that changed some path? Maybe you just want to know about the commits that have modified some file, and that can be found easily enough by running git log -- <path>.Sometimes, you might be interested not only in which commits touched <path>, but which merge commits brought those commits into the main line of developement. Have you ever found those merges difficult to find? You’re not alone. In most cases, Git will skip showing you those kind of merges with git log -- <path>, since those commits don’t modify the <path> by themselves.Now you can bring those merges back into view with Git’s new --show-pulls flag to revision walking commands, like git log and git rev-list. For a particularly informative view, try:
    $ git log --oneline --graph --show-pulls -- <path>
    

    [source]

  • When you run git pull in a repository when you’re tracking a remote branch, one of four things can happen: there might be no changes, changes on the server, client, or both. As long as there aren’t changes in both directions, resolving the difference is straightforward: when there are no changes at all, there’s nothing to do. When the server is strictly ahead of the client, the client fast-forwards to the state on the server.But, when there are change both on the client and on the server: what happens? That depends on whether not you have the pull.rebase configuration set. If you do, your branch is rebased on top of where you’re pulling from, and otherwise, a merge is performed.These merges can clutter your history and be tricky to back out of without starting over your pull from scratch. Git 2.28 now warns you of this case (specifically, when pull.rebase is unset, and you didn’t explicitly specify --[no-]rebase as an argument to git pull).

    [source]

  • Git now includes a GitHub Actions workflow which you can use to run Git’s own integration tests on a variety of platforms and compilers. There’s no extra effort required on your part: if you have a fork of git/git on GitHub, each push will be run through the array of tests necessary to validate your change. But wait: doesn’t Git use a mailing list for development? Yes, it does, but now you can use GitGitGadget on the git/git repository. This means that you can open a pull request, and have GitGitGadget send it to the mailing list on your behalf. So, if you’re more comfortable contributing to Git like that instead of composing emails manually, you can now contribute to Git from start to finish using GitHub.

    [source]

  • On the other hand, if you don’t mind sending an email or two, it’s now much easier to interact with the Git mailing list when you encounter a bug by running git bugreport. Running this new command will open your $EDITOR with a pre-populated form of questions that will be useful in debugging your issue. It also includes some helpful information about your system, like your CPU architecture, what version of Git you’re running, and so on.When you’re done, you can send that file as the body of an email to the Git mailing list, and rest assured that you’ve opened a helpful bug report.

    [source]

  • We’ve talked a number of times about Git’s clean and smudge filters and the corresponding process filter (which simulates multiple clean and smudge filters in a single process). Up until recently, the protocol for these filters has been relatively straightforward: Git supplies one end of the content, and the filter produces the other.In Git 2.27, more information is supplied over the protocol, like metadata about the branch being checked out in the case of git checkout, or the remote that was contacted in case of a git fetch. This new information could be used in tools like, for eg., Git LFS in order to figure out which remote to contact for extra data.

    [source]

  • Last but not least, git status learned some new tricks, too. You might recall from a recent blog post that we talked how sparse checkouts can shrink the size of your monorepo. Now, git status can remind you of when you are in a sparse checkout by telling you what percentage of files you have checked out.For fans of git-prompt.sh, the prompt will now display SPARSE if you are in a sparse checkout, too.

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest couple of releases. For more, check out the release notes for 2.27 and 2.28, or any previous version in the Git repository.

[1]: Note that since Bloom filters are not persisted automatically (that is, you have to pass --changed-paths explicitly on each subsequent write), it is a good idea to disable configuration that automatically generates commit-graphs, like fetch.writeCommitGraph and gc.writeCommitGraph.

Tags:

Written by

Related posts