The GitHub Blog 03月15日
Highlights from Git 2.49
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Git 2.49发布,具有多项新功能和错误修复。包括更快的打包方式、在部分克隆中回填历史blob、支持zlib-ng以提高性能、引入Rust代码以及其他相关改进。

Git 2.49引入新的哈希函数,考虑更多目录结构,提高打包性能和结果包大小。

新增git backfill工具,可在部分克隆中以少量批次填入缺失的历史blob。

Git 2.49可使用zlib-ng进行构建,早期实验显示打印Git存储库中所有对象的内容时速度提升约25%。

该版本首次引入Rust代码,包括两个Rust crate,是Git向更具库导向的代码发展的一部分。

该版本继续努力摆脱全局变量,并解决了一些警告问题。

The open source Git project just released Git 2.49 with features and bug fixes from over 89 contributors, 24 of them new. We last caught up with you on the latest in Git back when 2.48 was released.

To celebrate this most recent release, here is GitHub’s look at some of the most interesting features and changes introduced since last time.

Faster packing with name-hash v2

Many times over this series of blog posts, we have talked about Git’s object storage model, where objects can be written individually (known as “loose” objects), or grouped together in packfiles. Git uses packfiles in a wide variety of functions, including local storage (when you repack or GC your repository), as well as when sending data to or from another Git repository (like fetching, cloning, or pushing).

Storing objects together in packfiles has a couple of benefits over storing them individually as loose. One obvious benefit is that object lookups can be performed much more quickly in pack storage. When looking up a loose object, Git has to make multiple system calls to find the object you’re looking for, open it, read it, and close it. These system calls can be made faster using the operating system’s block cache, but because objects are looked up by a SHA-1 (or SHA-256) of their contents, this pseudo-random access isn’t very cache-efficient.

But most interesting to our discussion is that since loose objects are stored individually, we can only compress their contents in isolation, and can’t store objects as deltas of other similar objects that already exist in your repository. For example, say you’re making a series of small changes to a large blob in your repository. When those objects are initially written, they are each stored individually and zlib compressed. But if the majority of the file’s content remains unchanged among edit pairs, Git can further compress these objects by storing successive versions as deltas of earlier ones. Roughly speaking, this allows Git to store the changes made to an object (relative to some other object) instead of multiple copies of nearly identical blobs.

But how does Git figure out which pairs of objects are good candidates to store as delta-base pairs? One useful proxy is to compare objects that appear at similar paths. Git does this today by computing what it calls a “name hash”, which is effectively a sortable numeric hash that weights more heavily towards the final 16 non-whitespace characters in a filepath (source). This function comes from Linus all the way back in 2006, and excels at grouping functions with similar extensions (all ending in .c, .h, etc.), or files that were moved from one directory to another (a/foo.txt to b/foo.txt).

But the existing name-hash implementation can lead to poor compression when there are many files that have the same basename but very different contents, like having many CHANGELOG.md files for different subsystems stored together in your repository. Git 2.49 introduces a new variant of the hash function that takes more of the directory structure into account when computing its hash. Among other changes, each layer of the directory hierarchy gets its own hash, which is downshifted and then XORed into the overall hash. This creates a hash function which is more sensitive to the whole path, not just the final 16 characters.

This can lead to significant improvements both in packing performance, but also in the resulting pack’s overall size. For instance, using the new hash function was able to improve the time it took to repack microsoft/fluentui from ~96 seconds to ~34 seconds, and slimming down the resulting pack’s size from 439 MiB to just 160 MiB (source).

While this feature isn’t (yet) compatible with Git’s reachability bitmaps feature, you can try it out for yourself using either git repack’s or git pack-objects’s new --name-hash-version flag via the latest release.

[source]

Backfill historical blobs in partial clones

Have you ever been working in a partial clone and gotten this unfriendly output?

$ git blame README.mdremote: Enumerating objects: 1, done.remote: Counting objects: 100% (1/1), done.remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)Receiving objects: 100% (1/1), 1.64 KiB | 8.10 MiB/s, done.remote: Enumerating objects: 1, done.remote: Counting objects: 100% (1/1), done.remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)Receiving objects: 100% (1/1), 1.64 KiB | 7.30 MiB/s, done.[...]

What happened here? To understand the answer to that question, let’s work through an example scenario:

Suppose that you are working in a partial clone that you cloned with --filter=blob:none. In this case, your repository is going to have all of its trees, commit, and annotated tag objects, but only the set of blobs which are immediately reachable from HEAD. Put otherwise, your local clone only has the set of blobs it needs to populate a full checkout at the latest revision, and loading any historical blobs will fault in any missing objects from wherever you cloned your repository.

In the above example, we asked for a blame of the file at path README.md. In order to construct that blame, however, we need to see every historical version of the file in order to compute the diff at each layer to figure out whether or not a revision modified a given line. But here we see Git loading in each historical version of the object one by one, leading to bloated storage and poor performance.

Git 2.49 introduces a new tool, git backfill, which can fault in any missing historical blobs from a --filter=blob:none clone in a small number of batches. These requests use the new path-walk API (also introduced in Git 2.49) to group together objects that appear at the same path, resulting in much better delta compression in the packfile(s) sent back from the server. Since these requests are sent in batches instead of one-by-one, we can easily backfill all missing blobs in only a few packs instead of one pack per blob.

After running git backfill in the above example, our experience looks more like:

$ git clone --sparse --filter=blob:none git@github.com:git/git.git[...] # downloads historical commits/trees/tags$ cd git$ git sparse-checkout add builtin[...] # downloads current contents of builtin/$ git backfill --sparse[...] # backfills historical contents of builtin/$ git blame -- builtin/backfill.c85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000   1) /* We need this macro to access core_apply_sparse_checkout */85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000   2) #define USE_THE_REPOSITORY_VARIABLE85127bcdeab (Derrick Stolee 2025-02-03 17:11:07 +0000   3)[...]

But running git backfill immediately after cloning a repository with --filter=blob:none doesn’t bring much benefit, since it would have been more convenient to simply clone the repository without an object filter enabled in the first place. When using the backfill command’s --sparse option (the default whenever the sparse checkout feature is enabled in your repository), Git will only download blobs that appear within your sparse checkout, avoiding objects that you wouldn’t checkout anyway.

To try it out, run git backfill in any --filter=blob:none clone of a repository using Git 2.49 today!

[source, source]


The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.49, or any previous version in the Git repository.


    It’s true. It takes humans about 100-150 milliseconds to blink their eyes, and setting help.autocorrect to “1” will run the suggested command after waiting only 100 milliseconds (1 decisecond). 

The post Highlights from Git 2.49 appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Git 2.49 打包性能 部分克隆 zlib-ng Rust代码
相关文章