Supply chain attacks exploit our implicit trust of open source to hurt developers and our customers. Read our proposal for how npm will significantly reduce supply chain attacks by signing packages with Sigstore.
To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.
Before we get into the details of Git 2.37.0, we first wanted to let you know that Git Merge is returning this September. The conference features talks, workshops, and more all about Git and the Git ecosystem. There is still time to submit a proposal to speak. We look forward to seeing you there!
In Git, we often talk about classifying objects as either “reachable” or “unreachable”. An object is “reachable” when there is at least one reference (a branch or a tag) from which you can start an object walk (traversing from commits to their parents, from trees into their sub-trees, and so on) and end up at your destination. Similarly, an object is “unreachable” when no such reference exists.
A Git repository needs all of its reachable objects to ensure that the repository is intact. But it is free to discard unreachable objects at any time. And it is often desirable to do just that, particularly when many unreachable objects have piled up, you’re running low on disk space, or similar. In fact, Git does this automatically when running garbage collection.
But observant readers will notice the
gc.pruneExpire configuration. This setting defines a “grace period” during which unreachable objects which are not yet old enough to be removed from the repository completely are left alone. This is done in order to mitigate a race condition where an unreachable object that is about to be deleted becomes reachable by some other process (like an incoming reference update or a push) before then being deleted, leaving the repository in a corrupt state.
Setting a small, non-zero grace period makes it much less likely to encounter this race in practice. But it leads us to another problem: how do we keep track of the age of the unreachable objects which didn’t leave the repository? We can’t pack them together into a single packfile; since all objects in a pack share the same modification time, updating any object drags them all forward. Instead, prior to Git 2.37, each surviving unreachable object was written out as a loose object, and the mtime of the individual objects was used to store their age. This can lead to serious problems when there are many unreachable objects which are too new and can’t be pruned.
Git 2.37 introduces a new concept, cruft packs, which allow unreachable objects to be stored together in a single packfile by writing the ages of individual objects in an auxiliary table stored in an
*.mtimes file alongside the pack.
While cruft packs don’t eliminate the data race we described earlier, in practice they can help make it much less likely by allowing repositories to prune with a much longer grace period, without worrying about the potential to create many loose objects. To try it out yourself, you can run:
$ git gc --cruft --prune=1.day.ago
and notice that your
$GIT_DIR/objects/pack directory will have an additional
.mtimes file, storing the ages of each unreachable object written within the last 24 hours
$ ls -1 .git/objects/pack pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.idx pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.mtimes pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.pack pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.idx pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.pack
There’s a lot of detail we haven’t yet covered on cruft packs, so expect a more comprehensive technical overview in a separate blog post soon.
As we have discussed often before, one of the factors that significantly impact Git’s performance is the size of your working directory. When you run
git status, for example, Git has to crawl your entire working directory (in the worst case) in order to figure out which files have been modified.
Git has its own cached understanding of the filesystem to avoid this whole-directory traversal in many cases. But it can be expensive for Git to update its cached understanding of the filesystem with the actual state of the disk while you work.
In the past, Git has made it possible to integrate with tools like Watchman via a hook, making it possible to replace Git’s expensive refreshing process with a long-running daemon which tracks the filesystem state more directly.
But setting up this hook and installing a third-party tool can be cumbersome. In Git 2.37, this functionality is built into Git itself on Windows and macOS, removing the need to install an external tool and configure the hook.
You can enable this for your repository by enabling the
core.fsmonitor config setting.
$ git config core.fsmonitor true
After setting up the config, an initial
git status will take the normal amount of time, but subsequent commands will take advantage of the monitored data and run significantly faster.
The full implementation is impossible to describe completely in this post. Interested readers can follow along later this week with a blog post written by Jeff Hostetler for more information. We’ll be sure to add a link here when that post is published.
We previously announced Git’s sparse index feature, which helps speed up Git commands when using the sparse-checkout feature in a large repository.
In case you haven’t seen our earlier post, here’s a brief refresher. Often when working in an extremely large repository, you don’t need the entire contents of your repository present locally in order to contribute. For example, if your company uses a single monorepo, you may only be interested in the parts of that repository that correspond to the handful of products you work on.
Partial clones make it possible for Git to only download the objects that you care about. The sparse index is an equally important component of the equation. The sparse index makes it possible for the index (a key data structure which tracks the content of your next commit, which files have been modified, and more) to only keep track of the parts of your repository that you’re interested in.
When we originally announced the sparse index, we explained how different Git subcommands would have to be updated individually to take advantage of the sparse index. With Git 2.37.0, all of those integrations are now included in the core Git project and available to all users.
In this release, the final integrations were for
git sparse-checkout, and
git stash. In particular,
git stash has the largest performance boost of all of the integrations so far because of how the command reads and writes indexes multiple times in a single process, achieving a near 80% speed-up in certain cases (though see this thread for all of the details).
Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.
- Speaking of sparse checkouts, this release deprecates the non-
--cone-mode style of sparse checkout definitions.
For the uninitiated, the
git sparse-checkoutcommand supports two kinds of patterns which dictate which parts of your repository should be checked out: “cone” mode, and “non-cone” mode. The latter, which allows specifying individual files with a
.gitignore-style syntax, can be confusing to use correctly, and has performance problems (namely that in the worst case all patterns must try to be matched with all files, leading to slow-downs). Most importantly, it is incompatible with the sparse-index, which brings the performance enhancements of using a sparse checkout to all of the Git commands you’re familiar with.
For these reasons (and more!), the non-
--conemode style of patterns is discouraged, and users are instead encouraged to use
In our highlights from the last Git release, we talked about more flexible
fsyncconfiguration, which made it possible to more precisely define what files Git would explicitly synchronize with
fsync()and what strategy it would use to do that synchronization.
This release brings a new strategy to the list supported by
core.fsyncMethod: “batch”, which can provide significant speed-ups on supported filesystems when writing many individual files. This new mode works by staging many updates to the disk’s writeback cache before preforming a single
fsync()causing the disk to flush its writeback cache. Files are then atomically moved into place, guaranteeing that they are
fsync()-durable by the time they enter the object directory.
For now, this mode only supports batching loose object writes, and will only be enabled when
loose-objectsvalue. On a synthetic test of adding 500 files to the repository with
git add(each resulting in a new loose object), the new
batchmode imposes only a modest penalty over not
fsyncing at all.
On Linux, for example, adding 500 files takes .06 seconds without any
fsync()calls, 1.88 seconds with an
fsync()after each loose object write, and only .15 seconds with the new batched
fsync(). Other platforms display similar speed-ups, with a notable example being Windows, where the numbers are .35 seconds, 11.18 seconds, and just .41 seconds, respectively.
If you’ve ever wondered, “what’s changed in my repository since yesterday?”, one way you can figure that out is with the
--sinceoption, which is supported by all standard revision-walking commands, like
This option works by starting with the specified commits, and walking recursively along each commit’s parents, stopping the traversal as soon as it encounters a commit older than the
--sincedate. But in occasional circumstances (particularly when there is) clock skew this can produce confusing results.
For example, suppose you have three commits,
C2is the parent of
C1is the parent of
C2. If both
C3were written in the last hour, but
C2is a day old (perhaps because the committer’s clock is running slow), then a traversal with
--since=1.hour.agowill only show
C3, since seeing
C2causes Git to halt its traversal.
If you expect your repository’s history has some amount of clock skew, then you can use
--since-as-filterin place of
--since, which only prints commits newer than the specified date, but does not halt its traversal upon seeing an older one.
Even in a simple example, trying to remember what object filter was used to clone your repository requires this incantation:
$ git config remote.origin.partialCloneFilter
In Git 2.37, you can now access this information much more readily behind the
git remote, like so:
$ git remote -v origin [email protected]:git/git.git (fetch) [tree:0] origin [email protected]:git/git.git (push)
Here, you can easily see between the square-brackets that the remote
Speaking of remote configuration, Git 2.37 ships with support for warning or exiting when it encounters plain-text credentials stored in your configuration with the new
Storing credentials in plain-text in your repository’s configuration is discouraged, since it forces you to ensure you have appropriately restrictive permissions on the configuration file. Aside from storing the data unencrypted at rest, Git often passes the full URL (including credentials) to other programs, exposing them on systems where other processes have access to arguments list of sensitive processes. In most cases, it is encouraged to use Git’s credential mechanism, or tools like GCM.
This new setting allows Git to either ignore or halt execution when it sees one of these credentials by setting the
transfer.credentialsInUrlto “warn” or “die” respectively. The default, “allow”, does nothing.
If you’ve ever used
git add -pto stage the contents of your working tree incrementally, then you may be familiar with
git add‘s “interactive mode”, or
git add -i, of which
git add -pis a sub-mode.
In addition to “patch” mode,
git add -isupports “status”, “update”, “revert”, “add untracked”, “patch”, and “diff”. Until recently, this mode of
git add -iwas actually written in Perl. This command has been the most recent subject of a long-running effort to port Git commands written in Perl into C. This makes it possible to use Git’s libraries without spawning sub-processes, which can be prohibitively expensive on certain platforms.
The C reimplementation of
git add -ihas shipped in releases of Git as early as v2.25.0. In more recent versions, this reimplementation has been in “testing” mode behind an opt-in configuration. Git 2.37 promotes the C reimplementation by default, so Windows users should notice a speed-up when using
git add -p.
Last but not least, there is a lot of exciting work going on for Git developers, too, like improving the localization workflow, improving CI output with GitHub Actions, and reducing memory leaks in internal APIs.
If you’re interested in contributing to Git, now is a more exciting time than ever to start. Check out this guide for some tips on getting started.