Git clone: a data-driven study on cloning behaviors

@derrickstolee recently discussed several different git clone options, but how do those options actually affect your Git performance? Which option is fastest for your client experience? Which option is fastest for your build machines?…

|
| 15 minutes

@derrickstolee recently discussed several different git clone options, but how do those options actually affect your Git performance? Which option is fastest for your client experience? Which option is fastest for your build machines? How can these options impact server performance? If you are a GitHub Enterprise Server administrator it’s important that you understand how the server responds to these options under the load of multiple simultaneous requests.

Here at GitHub, we use a data-driven approach to answer these questions. We ran an experiment to compare these different clone options and measured the client and server behavior. It is not enough to just compare git clone times, because that is only the start of your interaction with a Git repository. In particular, we wanted to determine how these clone options change the behavior of future Git operations such as git fetch.

In this experiment, we aimed to answer the below questions:

  1. How fast are the various git clone commands?
  2. Once we have cloned a repository, what kind of impact do future git fetch commands have on the server and client?
  3. What impact do full, shallow and partial clones have on a Git server? This is mostly important for our GitHub Enterprise Server Admins.
  4. Will the repository shape and size make any difference in the overall performance?

It is worth special emphasis that these results come from simulations that we performed in our controlled environments and do not simulate complex workflows that might be used by many Git users. Depending on your workflows and repository characteristics these results may change. Perhaps this experiment provides a framework that you could follow to measure how your workflows are affected by these options. If you would like help analyzing your worksflows, feel free to engage with GitHub’s Professional Services team.

For a summary of our findings, feel free to jump to our conclusions and recommendations.

Experiment design

To maximize the repeatability of our experiment, we use open source repositories for our sample data. This way, you can compare your repository shape to the tested repositories to see which is most applicable to your scenario.

We chose to use the jquery/jqueryapple/swift and torvalds/linux repositories. These three repositories vary in size and number of commits, blobs, and trees.

These repositories were mirrored to a GitHub Enterprise Server running version 2.22 on a 8-core cloud machine. We use an internal load testing tool based on Gatling to generate git requests against the test instance. We ran each test with a specific number of users across 5 different load generators for 30 minutes. All of our load generators use git version 2.28.0 which by default is using protocol version 1. We would like to make a note that protocol version 2 only improves ref advertisement and therefore we don’t expect it to make a difference in our tests.

Once a test is complete, we use a combination of Gatling results, ghe-governor and server health metrics to analyze the test.

Test repository characteristics

The git-sizer tool measures the size of Git repositories along many dimensions. In particular, we care about the total size on disk along with the count of each object type. The table below contains this information for our three test repositories.

Repo blobs trees commits Size (MB)
jquery/jquery 14,779 22,579 7,873 40
apple/swift 390,116 649,322 132,316 750
torvalds/linux 2,174,798 4,619,864 968,500 4,000

The jquery/jquery repository is a fairly small repository with only 40MB of disk space. apple/swift is a medium-sized repository with around 130 thousand commits and 750MB on disk. The torvalds/linux repository is typically the gold standard for Git performance tests on open source repositories. It uses 4 gigabytes of disk space and has close to a million commits.

Test scenarios

We care about the following clone options:

  1. Full clones.
  2. Shallow clones (--depth=1).
  3. Treeless clones (--filter=tree:0).
  4. Blobless clones (--filter=blob:none).

In addition to these options at clone time, we can also choose to fetch in a shallow way using --depth=1. Since treeless and blobless clones have their own way to reduce the objects downloaded during git fetch, we only test shallow fetches on full and shallow clones.

We organized our test scenarios into the following ten categories, labeled T1 through T10. T1 to T4, simulate four different git clone types. T5 to T10 simulate various git fetch operations into these cloned repositories.

Test# git command Description
T1 git clone Full clone
T2 git clone --depth 1 Shallow clone
T3 git clone --filter=tree:0 Treeless partial clone
T4 git clone --filter=blob:none Blobless partial clone
T5 T1 + git fetch Full fetch in a fully cloned repository
T6 T1 + git fetch --depth 1 Shallow fetch in a fully cloned repository
T7 T2 + git fetch Full fetch in a shallow cloned repository
T8 T2 + git fetch --depth 1 Shallow fetch in a shallow cloned repository
T9 T3 + git fetch Full fetch in a treeless partially cloned repository
T10 T4 + git fetch Full fetch in a blobless partially cloned repository

In partial clones, the new blobs at the new ref tip are not downloaded until we navigate to that position and populate our working directory with those blob contents. To be a fair comparison with the full and shallow clone cases, we also have our simulation run git reset --hard origin/$branch in all T5 to T10 tests. In T5 to T8 this extra step will not have a huge impact, but in T9 and T10 it will ensure the blob downloads are included in the cost.

In all the scenarios above, a single user was also set to repeatedly change 3 random files in the repository and push them to the same branch that the other users were cloning and fetching. This simulates repository growth so the git fetch commands actually have new data to download.

Test Results

Let’s dig into the numbers to see what our experiment says.

git clone performance

The full numbers are provided in the tables below.

Unsurprisingly, shallow clone is the fastest clone for the client, followed by a treeless then blobless partial clones, and finally full clones. This performance is directly proportional to the amount of data required to satisfy the clone request. Recall that full clones need all reachable objects, blobless clones need all reachable commits and trees, treeless clones need all reachable commits. A shallow clone is the only clone type that does not grow at all along with the history of your repository.

The performance impact of these clone types grows in proportion to the repository size, especially the number of commits. For example, a shallow clone of torvalds/linux is four times faster than a full clone, while a treeless clone is only twice as fast and a blobless clone is only 1.5 times as fast. It is worth noting that the development culture of the Linux project promotes very small blobs that compress extremely well. We expect that the performance difference to be greater for most other projects with a higher blob count or size.

As for server performance, we see that the Git CPU time per clone is higher for the blobless partial clone (T4). Looking a bit closer with ghe-governor, we observe that the higher Git CPU is mainly due to the higher amount of pack-objects operations in the partial clone scenarios (T3 and T4). In the torvalds/linux repository, the Git CPU time spent on pack-objects is four times more in a treeless partial clone (T3) compared to a full clone (T1). In contrast, in the smaller jquery/jquery repository, a full clone consumes more CPU per clone compared to the partial clones (T3 and T4). Shallow clone of all the three different repositories, consumes the lowest amount of total and Git CPU per clone.

If the full clone is sending more data in a full clone, then why is it spending more CPU on a partial clone? When Git sends all reachable objects to the client, it mostly transfers the data it has on disk without decompressing or converting the data. However, partial and shallow clones need to extract that subset of data and repackage it to send to the client. We are investigating ways to reduce this CPU cost in partial clones.

The real fun starts after cloning the repository and users start developing and pushing code back up to the server. In the next section we analyze scenarios T5 through T10, which focus on the git fetch and git reset --hard origin/$branch commands.

jquery/jquery clone performance

Test# description clone avgRT (milliseconds) git CPU spent per clone
T1 full clone 2,000ms (Slowest) 450ms (Highest)
T2 shallow clone 300ms (6x faster than T1) 15ms (30x less than T1)
T3 treeless partial clone 900ms (2.5x faster than T1) 270ms (1.7x less than T1)
T4 blobless partial clone 900ms (2.5x faster than T1) 300ms (1.5x less than T1)

apple/swift clone performance

Test# description clone avgRT (seconds) git CPU spent per clone
T1 full clone 50s (Slowest) 8s (2x less than T4)
T2 shallow clone 8s (6x faster than T1) 3s (6x less than T4)
T3 treeless partial clone 16s (3x faster than T1) 13s (Similar to T4)
T4 blobless partial clone 22s (2x faster than T1) 15s (Highest)

torvalds/linux clone performance

Test# description clone avgRT (minutes) git CPU spent per clone
T1 full clone 5m (Slowest) 60s (2x less than T4)
T2 shallow clone 1.2m (4x faster than T1) 40s (3.5x less than T4)
T3 treeless partial clone 2.4m (2x faster than T1) 120s (Similar to T4)
T4 blobless partial clone 3m (1.5x faster than T1) 130s (Highest)

git fetch performance

The full fetch performance numbers are provided in the tables below, but let’s first summarize our findings.

The biggest finding is that shallow fetches are the worst possible options, in particular from full clones. The technical reason is that the existence of a “shallow boundary” disables an important performance optimization on the server. This causes the server to walk commits and trees to find what’s reachable from the client’s perspective. This is more expensive in the full clone case because there are more commits and trees on the client that the server is checking to not duplicate. Also, as more shallow commits are accumulated, the client needs to send more data to the server to describe those shallow boundaries.

The two partial clone options have drastically different behavior in the git fetch and git reset --hard origin/$branch sequence. Fetching from blobless partial clones increases the reset command by a small, measurable way, but not enough to make a huge difference from the user perspective. In contrast, fetching from a treeless partial clone causes significant more time because the server needs to send all trees and blobs reachable from a commit’s root tree in order to satisfy the git reset --hard origin/$branch command.

Due to these extra costs as a repository grows, we strongly recommend against shallow fetches and fetching from treeless partial clones. The only recommended scenario for a treeless partial clone is for quickly cloning on a build machine that needs access to the commit history, but will delete the repository at the end of the build.

Blobless partial clones do increase the Git CPU costs on the server somewhat, but the network data transfer is much less than a full clone or a full fetch from a shallow clone. The extra CPU cost is likely to become less important if your repository has larger blobs than our test repositories. In addition, you have access to the full commit history, which might be valuable to real users interacting with these repositories.

It is also worth noting that we noticed a surprising result during our testing. During T9 and T10 tests for the Linux repository, our load generators encountered memory issues as it seems that these scenarios with the heavy load that we were running, triggered more auto Garbage Collections (GC). GC in the Linux repository is expensive and involves a full repack of all Git data. Since we were testing on a Linux client, the GC processes were launched in the background to avoid blocking our foreground commands. However, as we kept fetching we ended up with several concurrent background processes; this is not a realistic scenario but a factor of our synthetic load testing. We ran git config gc.auto false to prevent this from affecting our test results.

It is worth noting that blobless partial clones might trigger automatic garbage collection more often than a full clone. This is a natural byproduct of splitting the data into a larger number of small requests. We have work in progress to make Git’s repository maintenance be more flexible, especially for large repositories where a full repack is too time-consuming. Look forward to more updates about that feature here on the GitHub blog.

jquery/jquery fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 200ms 5ms 4ms (Lowest)
T6 shallow fetch in a fully cloned repository 300ms 5ms 18ms (4x more than to T5)
T7 full fetch in a shallow cloned repository 200ms 5ms 4ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 250ms 5ms 14ms (3x more than to T5)
T9 full fetch in a treeless partially cloned repository 200ms 85ms 9ms (2x more than to T5)
T10 full fetch in a blobless partially cloned repository 200ms 40ms 6ms (1.5x more than to T5)

apple/swift fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 80ms 20ms (Lowest)
T6 shallow fetch in a fully cloned repository 1,500ms 80ms 300ms (13x more than to T5)
T7 full fetch in a shallow cloned repository 300ms 80ms 20ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 350ms 80ms 45ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 300ms 300ms 70ms (3x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 150ms 35ms (1.5x more than to T5)

torvalds/linux fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 250ms 40ms (Lowest)
T6 shallow fetch in a fully cloned repository 6,000ms 250ms 1,000ms (25x more than T5)
T7 full fetch in a shallow cloned repository 300ms 250ms 50ms (1.3x more than T5)
T8 shallow fetch in a shallow cloned repository 400ms 250ms 80ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 350ms 1250ms 400ms (10x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 500ms 140ms (3.5x more than to T5)

What does this mean for you?

Our experiment demonstrated some performance changes between these different clone and fetch options. Your mileage may vary! Our experimental load was synthetic, and your repository shape can differ greatly from these repositories.

Here are some common themes we identified that could help you choose the right scenario for your own usage:

If you are a developer focused on a single repository, the best approach is to do a full clone and then always perform a full fetch into that clone. You might deviate to a blobless partial clone if that repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone. Bear in mind that in larger repositories such as the torvalds/linux repository, it will save time on the client but it’s a bit heavier on your git server when compared to a full clone.

Blobless partial clones are particularly effective if you are using Git’s sparse-checkout feature to reduce the size of your working directory. The combination greatly reduces the number of blobs you need to do your work. Sparse-checkout does not reduce the required data transfer for shallow clones.

Notably, we did not test repositories that are significantly larger than the torvalds/linux repository. Such repositories are not really available in the open, but are becoming increasingly common for private repositories. If you feel your repository is not represented by our test repositories, then we recommend trying to replicate our experiments yourself.

As can be observed, our test process is not simulating a real life situation where users have different workflows and work on different branches. Also the set of Git commands that have been analyzed in this study is a small set and is not a representative of a user’s daily Git usage. We are continuing to study these options to get a holistic view of how they change the user experience.

Acknowledgements

A special thanks to Derrick Stolee and our Professional Services team for their efforts and sponsorship of this study!

Tags:

Related posts