How to update a git shallow clone?

Question

Welcome To Ask or Share your Answers For Others

How to update a git shallow clone?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

How to update a git shallow clone?

Background

(for tl;dr, see #questions below)

I have multiple git repository shallow clones. I'm using shallow clones because it's a lot smaller compared to a deep clone. Each is cloned doing about git clone --single-branch --depth 1 <git-repo-url> <dir-name>.

This works fine, except I don't see how to update it.

When I'm cloning by a tag, update is not meaningful, as a tag is frozen point in time (as I understand it). In this case, if I want to update, this means I want to clone by another tag, so I just rm -rf <dir-name> and clone again.

Things get more complicated when I’ve cloned the HEAD of a master branch then later want to update it.

I tried git pull --depth 1 but although I'm not to push anything to the remote repository, it complains it don’t know who I am.

I tried git fetch --depth 1, but although it seems to update something, I checked it is not up to date (some files on the remote repository have a different content than the ones on my clone).

After https://stackoverflow.com/a/20508591/279335 , I tried git fetch --depth 1; git reset --hard origin/master, but two things: first I don't understand why git reset is needed, second, although the files seems to be up to date, some old files remains, and git clean -df does not delete these files.

Questions

Let a clone created with git clone --single-branch --depth 1 <git-repo-url> <dir-name>. How to update it to achieve the same result as rm -rf <dir-name>; git clone --single-branch --depth 1 <git-repo-url> <dir-name>? Or is rm -rf <dir-name> and clone again the only way?

Note

This is not a duplicate of How to update a shallow cloned submodule without increasing main repo size , as the answer does not fulfil my expectations and I'm using simple repositories, not sub?modules (which I don't know about).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:52:12+0000

TL;DR

Given that you have an existing --depth 1 repository cloned from branch B and you'd like Git to act as if you removed and re-cloned, you can use this sequence of commands:

git fetch --depth 1
git reset --hard origin/B
git clean -dfx

(e.g., git reset --hard origin/master—I cannot put italics in the code-literal section above). You should be able to do the git clean step at any point before or after the other two commands, but the git reset must come after the git fetch.

Long

[slightly reworded and formatted] Given a clone created with git clone --single-branch --depth 1 url directory, how can I update it to achieve the same result as rm -rf directory; git clone --single-branch --depth 1 url directory?

Note that --single-branch is the default when using --depth 1. The (single) branch is the one you give with -b. There's a long aside that goes here about using -b with tags but I will leave that for later. If you don't use -b, your Git asks the "upstream" Git—the Git at url—which branch it has checked-out, and pretends you used -b thatbranch. This means that it is important to be careful when using --single-branch without -b to make sure that this upstream repository's current branch is sensible, and of course, when you do use -b, to make sure that the branch argument you give really does name a branch, not a tag.

The simple answer is basically this one, with two slight changes:

After https://stackoverflow.com/a/20508591/279335, I tried git fetch --depth 1; git reset --hard origin/master, but two things: first I don't understand why git reset is needed, second, although the files seems to be up to date, some old files remains, and git clean -df does not delete these files.

The two slight changes are: make sure you use origin/branchname instead, and add -x (git clean -d -f -x or git clean -dfx) to the git clean step. As for why, that gets a bit more complicated.

What's going on

Without --depth 1, the git fetch step calls up the other Git and gets from it a list of branch names and corresponding commit hash IDs. That is, it finds a list of all the upstream's branches and their current commits. Then, because you have a --single-branch repository, your Git throws out all but the single branch, and brings over everything Git needs to connect that current commit back to the commit(s) you already have in your repository.

With --depth 1, your Git doesn't bother connecting the new commit to older historical commits at all. Instead, it obtains just the one commit and the other Git objects needed to complete that one commit. It then writes an additional "shallow graft" entry to mark that one commit as a new pseudo-root commit.

Regular (non-shallow) clone and fetch

These are all related to how Git behaves when you're using a normal (non-shallow, non-single-branch) clone: git fetch calls up the upstream Git, gets a list of everything, and then brings over whatever you don't already have. This is why an initial clone is so slow, and a fetch-to-update is usually so fast: once you get a full clone, the updates rarely have very much to bring over: maybe a few commits, maybe a few hundred, and most of those commits don't need much else either.

The history of a repository is formed from the commits. Each commit names its parent commit (or for merges, parent commits, plural), in a chain that goes backwards from "the latest commit", to the previous commit, to some more-ancestral commit, and so on. The chain eventually stops when it reaches a commit that has no parent, such as the first commit ever made in the repository. This kind of commit is a root commit.

That is, we can draw a graph of commits. In a really simple repository the graph is just a straight line, with all the arrows pointing backwards:

o <- o <- o <- o   <-- master

The name master points to the fourth and latest commit, which points back to the third, which points back to the second, which points back to the first.

Each commit carries with it a complete snapshot of all the files that go in that commit. Files that are not at all changed are shared across these commits: the fourth commit just "borrows" the unchanged version from the third commit, which "borrows" it from the second, and so on. Hence, each commit names all the "Git objects" that it needs, and Git either finds those objects locally—because it already has them—or uses the fetch protocol to bring them over from the other, upstream Git. There's a compression format called "packing", and a special variant for network transfer called "thin packs", that allows Git to do this even better / fancier, but the principle is simple: Git needs all, and only, those objects that go with the new commits it's picking up. Your Git decides whether it has those objects, and if not, obtains them from their Git.

A more-complicated, more-complete graph generally has several points where it branches, some where it merges, and multiple branch names pointing to different branch tips:

        o--o   <-- feature/tall
       /
o--o--o---o    <-- master
        /
     o--o      <-- bug/short

Here branch bug/short is merged back into master, while branch feature/tall is still undergoing development. The name bug/short can (probably) now be deleted entirely: we don't need it anymore if we are done making commits on it. The commit at the tip of master names two previous commits, including the commit at the tip of bug/short, so by fetching master we will fetch the bug/short commits.

Note that both the simple and slightly-more-complicated graph each have just one root commit. That's pretty typical: all repositories that have commits have at least one root commit, since the very first commit is always a root commit; but most repositories have only one root commit as well. You can, however, have different root commits, as with this graph:

 o--o
     
o--o--o   <-- master

or this one:

 o--o     <-- orphan

o--o      <-- master

In fact, the one with just the one master was probably made by merging orphan into master, then deleting the name orphan.

Grafts and replacements

Git has for a long time had (possibly shaky) support for grafts, which was replaced with (much better, actually-solid) support for generic replacements. To grasp them concretely we need to add, to the above, the notion that each commit has its own unique ID. These IDs are the big ugly 40-character SHA-1 hashes, face0ff... and so on. In fact, every Git object has a unique ID, though for graph purposes, all we care about are the commits.

For drawing graphs, those big hash IDs are too painful to use, so we can use one-letter names A through Z instead. Let's use this graph again but put in one-letter names:

        E--H   <-- feature/tall
       /
A--B--D---G    <-- master
        /
     C--F      <-- bug/short

Commit H refers back to commit E (E is H's parent). Commit G, which is a merge commit—meaning it has at least two parents—refers back to both D and F, and so on.

Note that the branch names, feature/tall, master, and bug/short, each point to one single commit. The name bug/short points to commit F. This is why commit F is on branch bug/short ... but so is commit C. Commit C is on bug/short because it is reachable from the name. The name gets us to F, and F gets us to C, so C is on branch bug/short.

Note, however, that commit G, the tip of master, gets us to commit F. This means that commit F is also on branch master. This is a key concept in Git: commits may be on one, many, or even no branches. A branch name is merely a way to get started within a commit graph. There are other ways, such as tag names, refs/stash (which gets you to the current stash: each stash is actually a couple of commits), and the reflogs (which are normally hidden from view as they are normally just clutter).

This also, however, gets us to grafts and replacements. A graft is just a limited kind of replacement, and shallow repositories use a limited form of graft.¹ I won't describe replacements fully here as they are a bit more complicated, but in general, what Git does for all of these is to use the graft or replacement as an "instead-of". For the specific case of commits, what we want here is to be able to change—or at least, pretend to change—the parent ID or IDs of any commit ... and for shallow repositories, we want to be able to pretend that the commit in question has no parents.

¹The way shallow repositories use the graft code is not shaky. For the more general case, I recommended using git replace instead, as that also was and is not shaky. The only recommended use for grafts is—or at least was, years ago—to put them in place just long enough to run git filter-branch to copy an altered—grafted—history, after which you should just discard the grafted history entirely. You can use git replace for this purpose as well, but unlike grafts, you can use git replace permanently or semi-permanently, without needing git filter-branch.

Making a shallow clone

To make a depth-1 shallow clone of the current state of the upstream repository, we will pick one of the three branch names—feature/tall, master, or bug/short—and translate it to a commit ID. Then we will write a special graft entry that says: "When you see that commit, pretend that it has no parent commits, i.e., is a root commit."

Let's say we pick master. The name master points to commit G, so to make a shallow clone of commit G, we obtain commit G from the upstream Git as usual, but then write a special graft entry that claims commit G has no parents. We put that into our repository, and now our graph looks like this:

Categories

How to update a git shallow clone?

How to update a git shallow clone?

Background

Questions

Note

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

TL;DR

Long

What's going on

Regular (non-shallow) clone and fetch

Grafts and replacements

Making a shallow clone

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags