fsck: snapshot default refs before object walk #2026

newren · 2025-12-29T18:14:36Z

Fsck has a race when operating on live repositories; consider the following simple script that writes new commits as fsck runs: #!/bin/bash git fsck & PID=$! while ps -p $PID >/dev/null; do sleep 3 git commit -q --allow-empty -m "Another commit" done Since fsck reads refs at the beginning, walks those for connectivity, and then reads the refs again at the end to check, this can cause fsck to get confused and think that the new refs refer to missing commits and that new reflog entries are invalid. Running the above script in a clone of git.git results in the following (output ellipsized to remove additional errors of the same type): $ ./fsck-while-writing.sh Checking ref database: 100% (1/1), done. Checking object directories: 100% (256/256), done. warning in tag d6602ec: missingTaggerEntry: invalid format - expected 'tagger' line Checking objects: 100% (835091/835091), done. error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310 error: HEAD: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310 error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68 error: HEAD: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68 [...] error: HEAD: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09 error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a error: HEAD: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a error: HEAD: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d error: refs/heads/mybranch invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310 error: refs/heads/mybranch: invalid reflog entry 2aac9f9286e2164fbf8e4f1d1df53044ace2b310 error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68 error: refs/heads/mybranch: invalid reflog entry da0f5b80d61844a6f0ad2ddfd57e4fdfa246ea68 [...] error: refs/heads/mybranch: invalid reflog entry 87c8a5c2f6b79d9afa9e941590b9a097b6f7ac09 error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a error: refs/heads/mybranch: invalid reflog entry d80887a48865e6ad165274b152cbbbed29f8a55a error: refs/heads/mybranch: invalid reflog entry 6724f2dfede88bfa9445a333e06e78536c0c6c0d Checking connectivity: 833846, done. missing commit 6724f2dfede88bfa9445a333e06e78536c0c6c0d Verifying commits in commit graph: 100% (242243/242243), done. This problem doesn't occur when refs are specified on the command line for us to check, since we use those specified refs for both walking and checking. Using the same refs for walking and checking seems to just make sense, so modify the existing code to do the same when refs aren't specified. Snapshot the refs at the beginning, and also ignore all reflog entries since the time of our snapshot (while this technically means we could ignore a reflog entry created before the fsck process if the local clock is weird, since reflogs are local-only there are not concerns about differences between clocks on different machines). This combination of changes modifies the output of running the above script to: $ ./fsck-while-writing.sh Checking ref database: 100% (1/1), done. Checking object directories: 100% (256/256), done. warning in tag d6602ec: missingTaggerEntry: invalid format - expected 'tagger' line Checking objects: 100% (835091/835091), done. Checking connectivity: 833846, done. Verifying commits in commit graph: 100% (242243/242243), done. While worries about live updates while running fsck is likely of most interest for forge operators, it will likely also benefit those with automated jobs (such as git maintenance) or even casual users who want to do other work in their clone while fsck is running. Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> Co-authored-by: Elijah Newren <newren@gmail.com> [en: several changes: * adjusted for upstream refactorings to refs callback call signatures * handle reflogs as well * free recorded snapshot of refs when done * default to snapshotting instead of making it a non-default option * provide reproducible testcase in commit message and rewrite commit message around it ] Signed-off-by: Elijah Newren <newren@gmail.com>

newren · 2025-12-29T19:11:25Z

/submit

gitgitgadget · 2025-12-29T19:12:43Z

Submitted as pull.2026.git.1767035549378.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2026/newren/fsck-snapshot-v1

To fetch this version to local tag pr-2026/newren/fsck-snapshot-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2026/newren/fsck-snapshot-v1

gitgitgadget · 2025-12-30T02:05:00Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This problem doesn't occur when refs are specified on the command line
> for us to check, since we use those specified refs for both walking and
> checking.  Using the same refs for walking and checking seems to just
> make sense, so modify the existing code to do the same when refs aren't
> specified.

Excellent analysis and good approach.

> Snapshot the refs at the beginning, and also ignore all
> reflog entries since the time of our snapshot (while this technically
> means we could ignore a reflog entry created before the fsck process
> if the local clock is weird, since reflogs are local-only there are not
> concerns about differences between clocks on different machines).

Repository on a network filesystem being accessed by hosts with
broken clock?

I do not think our reflog API has (1) give me some token to mark
your current state (2) here is the token you gave me earlier, now
iterate and yield entries but ignore entries added after you gave me
that token, so going by the reflog timestamp is probably the best we
could do.  Any approach may get confused when the user tries to be
cute and issues "reflog delete" or "reflog expire" in the middle
anyway, I suspect ;-)

> While worries about live updates while running fsck is likely of most
> interest for forge operators, it will likely also benefit those with
> automated jobs (such as git maintenance) or even casual users who want
> to do other work in their clone while fsck is running.

Great.  Will queue.  Thanks.

> @@ -509,6 +510,9 @@ static int fsck_handle_reflog_ent(const char *refname,
>  				  timestamp_t timestamp, int tz UNUSED,
>  				  const char *message UNUSED, void *cb_data UNUSED)
>  {
> +	if (now && timestamp > now)
> +		return 0;
> +
>  	if (verbose)
>  		fprintf_ln(stderr, _("Checking reflog %s->%s"),
>  			   oid_to_hex(ooid), oid_to_hex(noid));
> @@ -567,14 +571,53 @@ static int fsck_head_link(const char *head_ref_name,
>  			  const char **head_points_at,
>  			  struct object_id *head_oid);
>  
> -static void get_default_heads(void)
> +struct ref_snapshot {
> +	size_t nr;
> +	size_t name_alloc;
> +	size_t oid_alloc;
> +	char **refname;
> +	struct object_id *oid;
> +};

This data structure is somewhat unexpected.  Instead of a struct
that holds two arrays, I would have rather expected an array of
"struct { refname, oid }", with the possiblity to add a "token to
mark the latest reflog entry" to the mix I alluded to earlier when
such an API function materializes.

[Footnote]

We could call refs_for_each_reflog_ent_reverse(), grab the
parameters that each_reflog_ent_fn receives as that "token" for the
latest reflog entry and stop.  That way, we will learn the value of
<old,new,committer,timestamp,tz,msg>, which should be a robust
enough unique key.

After that when iterating over the reflog, we know we should stop
after processing the reflog entry that holds the recorded value.

gitgitgadget · 2026-01-02T05:53:29Z

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Dec 29, 2025 at 07:12:29PM +0000, Elijah Newren via GitGitGadget wrote:

> Fsck has a race when operating on live repositories; consider the
> following simple script that writes new commits as fsck runs:
> 
>     #!/bin/bash
>     git fsck &
>     PID=$!
> 
>     while ps -p $PID >/dev/null; do
>         sleep 3
>         git commit -q --allow-empty -m "Another commit"
>     done
> 
> Since fsck reads refs at the beginning, walks those for connectivity,
> and then reads the refs again at the end to check, this can cause fsck
> to get confused and think that the new refs refer to missing commits and
> that new reflog entries are invalid.

I'm not sure if this is entirely accurate. Does fsck read refs at the
beginning? I think it walks over everything in the object database
without regard to connectivity, marking each with the HAS_OBJ flag. And
then we use the refs to find the reachable objects, making sure any that
are reachable also have HAS_OBJ set (otherwise they are missing).

So the race is really:

  1. We look at all of the packs and loose objects to set HAS_OBJ.

  2. Somebody else simultaneously adds a new object, which is missed by
     our step 1, and updates a ref to point to it.

  3. We look at the refs for reachability, see the new object, but
     think it is missing because we never saw it in step 1.

And your fix is to snapshot the refs as a "step 0", and use that
snapshot in step 3. So any new objects that are introduced after step 1
will never be referenced, since we are using the snapshot values.

Which makes sense as long as we assume objects are only added to the
repository. I think we'd now have the opposite direction race:

  0. We snapshot the refs.

  1. Somebody else deletes a ref, and then does a pruning git-gc which
     deletes the object it pointed to.

  2. We look at all of the objects and mark them as HAS_OBJ. We do not
     include the now-deleted object.

  3. We do a connectivity check with the snapshot, and are dismayed to
     find that the deleted object (which we believe is still referenced)
     has gone away.

I think you could argue that this is a much more preferable race,
though. A busy server will see lots of new objects introduced and refs
updated, and you do not want to have a stop-the-world lock that prevents
pushes. But it is much less common to do a pruning gc, and it is
probably OK to have a mutually exclusive lock between fsck and gc.

> This problem doesn't occur when refs are specified on the command line
> for us to check, since we use those specified refs for both walking and
> checking.  Using the same refs for walking and checking seems to just
> make sense, so modify the existing code to do the same when refs aren't
> specified.

So I don't think this part is quite right either, then. We're not using
the command-line arguments collect the set of objects in the repo. That
still happens by walking over the odb itself. So if I do:

  git fsck --no-dangling HEAD

in git.git, and while it is running, do this in another terminal:

  git commit --allow-empty -m foo

then I get:

  error: 49a90c19d0cd010ed00fbb1e4256cbefaa8b83e2: object missing

even with your patch. So I think that names given on the command-line
could benefit from this type of snapshot, because they suffer from the
same race. You want to lock in the ref resolution (whether from
iterating or from names on the command-line) before you start walking
over the odb.

  Side note: I do not think I have ever run fsck with refs on the
  command-line. It is not like it saves you any time! Most of the
  expense comes from opening up and verifying the objects in the first
  step, not from looking at ref reachability.

And one final note on the overall direction of the patch. We are
assuming that if we look at the refs first and then the odb second, that
we will be getting a "fresh" view of the odb in that second step. But
that isn't necessarily so, as we might have loaded the set of packs
earlier in the process. I don't know if it is possible to trigger that
during fsck or not, but certainly it is relying on a subtle assumption.
It probably is worth calling odb_reprepare() after taking the snapshot
to ensure we are not getting any results cached from before the snapshot
was taken.

> +struct ref_snapshot {
> +	size_t nr;
> +	size_t name_alloc;
> +	size_t oid_alloc;
> +	char **refname;
> +	struct object_id *oid;
> +};

Minor nit, but: why keep two arrays and not a single struct with both?
After all, you even end up sticking them back in a struct at the only
point of use:

> +	if (the_refs)
> +		for (size_t i = 0; i < the_refs->nr; i++) {
> +			struct reference ref = {
> +				.name = the_refs->refname[i],
> +				.oid = &the_refs->oid[i],
> +			};
> +			fsck_handle_ref(&ref, NULL);
> +		}

So this could really just be an array of "struct reference". You can't
just hold onto the "struct reference" passed in to the snapshot_refs()
callback (because it gets reused as we iterate), but you could do a deep
copy.

That did make me wonder a bit about the other fields in "struct
reference" (which your snapshot just throws away). But it looks like
fsck_handle_ref() only cares about the name and oid, so it is OK.

> @@ -999,6 +1050,19 @@ int cmd_fsck(int argc,
>  	if (check_references)
>  		fsck_refs(the_repository);
>  
> +	/*
> +	 * Take a snapshot of the refs before walking objects to avoid looking
> +	 * at a set of refs that may be changed by the user while we are walking
> +	 * objects. We can still walk over new objects that are added during the
> +	 * execution of fsck but won't miss any objects that were reachable.
> +	 */
> +	use_snapshot = !argc;
> +	if (use_snapshot) {
> +		now = time(NULL);
> +		refs_for_each_rawref(get_main_ref_store(the_repository),
> +				     snapshot_refs, &default_refs_snapshot);
> +	}

BTW, one of the reasons I started looking at this is that Coverity
complained about this segment of code. We set use_snapshot if and only
if we don't have any argc arguments. And then later...

>  	if (!argc) {
> -		get_default_heads();
> +		get_default_heads(use_snapshot ? &default_refs_snapshot : NULL);
>  		keep_cache_objects = 1;
>  	}

...we enter this block only if argc is zero. So we know that
use_snapshot will be true here, and the NULL path (and thus the fallback
code in get_default_heads()) will never be used.

That's not wrong exactly, as it's "just" dead code. But it was what led
me to thinking about whether the case of non-zero argc would benefit
from the snapshot, too.

There are a few other related interesting cases, too:

  - We may use the index file for connectivity, as well. It suffers from
    the same race, and would benefit from a snapshot.

  - In get_default_heads() we also look at worktree HEADs. Those have
    the same race (their normal refs we don't consider here, because
    they were already handled by the overall ref iteration).

I know that neither of those is of particular interest to you as a bare
server repo would have neither. And it may be OK not to handle them, if
the complexity doesn't merit it. But it might be worth documenting the
short-coming.

-Peff

PS The other reason I looked at your patch is that I got deja vu from
   all of this. I thought we had discussed ref snapshotting for fsck
   before, but I couldn't find anything on the list. It may have been
   internal GitHub discussions.

gitgitgadget · 2026-01-02T05:53:31Z

User Jeff King <peff@peff.net> has been added to the cc: list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fsck: snapshot default refs before object walk #2026

fsck: snapshot default refs before object walk #2026

newren commented Dec 29, 2025 •

edited by gitgitgadget bot

Loading

Uh oh!

newren commented Dec 29, 2025

Uh oh!

gitgitgadget bot commented Dec 29, 2025

Uh oh!

gitgitgadget bot commented Dec 30, 2025

Uh oh!

gitgitgadget bot commented Jan 2, 2026

Uh oh!

gitgitgadget bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fsck: snapshot default refs before object walk #2026

Are you sure you want to change the base?

fsck: snapshot default refs before object walk #2026

Conversation

newren commented Dec 29, 2025 • edited by gitgitgadget bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newren commented Dec 29, 2025

Uh oh!

gitgitgadget bot commented Dec 29, 2025

Uh oh!

gitgitgadget bot commented Dec 30, 2025

Uh oh!

gitgitgadget bot commented Jan 2, 2026

Uh oh!

gitgitgadget bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

newren commented Dec 29, 2025 •

edited by gitgitgadget bot

Loading