Skip to content

feat(gitter): affected commits graph walking#4974

Open
Ly-Joey wants to merge 27 commits intogoogle:masterfrom
Ly-Joey:feat-gitter-affected-commits
Open

feat(gitter): affected commits graph walking#4974
Ly-Joey wants to merge 27 commits intogoogle:masterfrom
Ly-Joey:feat-gitter-affected-commits

Conversation

@Ly-Joey
Copy link
Contributor

@Ly-Joey Ly-Joey commented Mar 5, 2026

Add graph related logic and find affected commits.

Definition of affected commits:

A commit is affected when: from at least one introduced that is an ancestor of the commit, there is no path between them that passes through a fix. A fix can either be a fixed commit, or the children of a lastAffected commit.

Key changes

New endpoint:

  • POST /affected-commits:
    • Accepts a list of events (each containing an eventType (introduced, fixed, lastAffected, or limit) and commit hash)
    • Returns a list of affected commits (hash in bytes and associated refs) in protobuf format

Cherry-pick Detection (optional):
Expand introduced and fixed commits to include their cherry-picked equivalents (determined with PatchID)

There's a follow-up PR (WIP) to refactor gitter (the repetitive fetches and semaphore logic, etc) and add LRU cache for repos.

@Ly-Joey
Copy link
Contributor Author

Ly-Joey commented Mar 9, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality for determining affected commits by walking the git graph. However, it also introduces several critical security vulnerabilities, primarily related to command injection and SSRF. The most severe issue is in buildCommitGraph, where user-controlled data is concatenated into a shell command, and new endpoints lack the protocol validation present in existing ones, allowing for command execution via git protocols. Additionally, a potential Denial of Service vulnerability due to unvalidated input length during hash parsing was identified. Beyond these security concerns, a critical bug was found in the graph building logic, and there are opportunities to improve code structure and testing practices to enhance maintainability and robustness.

@Ly-Joey Ly-Joey force-pushed the feat-gitter-affected-commits branch from 4227a68 to 97b9e25 Compare March 10, 2026 02:19
@Ly-Joey
Copy link
Contributor Author

Ly-Joey commented Mar 10, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new /affected-commits API endpoint to the gitter service, enabling clients to query for commits affected by specific events (introduced, fixed, last_affected, limit) within a Git repository. This includes implementing graph traversal logic (Affected and Limit methods) with support for cherry-pick detection, defining new Protobuf messages for the API response, and adding URL validation to both the new and existing /cache handlers. The Commit struct was updated to use Refs instead of Tags, and rootCommits were added to the Repository struct. Review comments highlighted several areas for improvement: optimizing regex compilation for URL validation in both cacheHandler and affectedCommitsHandler by compiling it once at the package level; enhancing test robustness and readability in TestAffectedCommitsHandler by using EventType constants instead of hardcoded strings; clarifying the Limit function's assumption about linear history when traversing commits by following only the first parent; and correcting a bug in the decodeSHA1 helper function in repository_test.go where fmt.Sprintf incorrectly pads with spaces instead of zeros, which would cause hex.DecodeString to fail.

@Ly-Joey Ly-Joey marked this pull request as ready for review March 10, 2026 04:59
@Ly-Joey Ly-Joey requested a review from tobyhawker March 10, 2026 22:23
@tobyhawker
Copy link
Contributor

I'm not sure that the definition of "affected" is quite right:

A commit is affected when: from at least one introduced there is no path from the commit through a fix to the introduced

As phrased, I don't think it's made explicit that only commits where the introduced is an ancestor are affected. How does something like this sound (just a suggestion, please feel free to adjust)?

A commit is affected when: from at least one introduced that is an ancestor of the commit, there is no path between them that passes through a fixed.

Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

}
logger.InfoContext(ctx, "Received request: /affected-commits", slog.Any("introduced", introduced), slog.Any("fixed", fixed), slog.Any("last_affected", lastAffected), slog.Any("limit", limit), slog.Bool("cherrypickIntro", cherrypickIntro), slog.Bool("cherrypickFixed", cherrypickFixed))

semaphore <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm originally I think this semaphore was meant to be inside the singleflight.

Comment on lines +615 to +620
repoDirName := getRepoDirName(url)
repoPath := filepath.Join(gitStorePath, repoDirName)

repoLock := GetRepoLock(url)
repoLock.RLock()
defer repoLock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think the repo lock can be attached to the repo itself, so you can't access the underlying repo without locking. Can we do something like this:

Make fetchRepo return a repo struct, which contains:

  • path to repo
  • *Repository (nil by default) and private
    • to the repo mutex in the global map

then Load can be a function on the repo struct, which will do the locking before loading the repository. Just an idea at the moment, you probably can play around with the types a bit to improve it.

if cherrypickIntro {
introduced = r.expandByCherrypick(introduced)
}
if cherrypickFixed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be be cherrypicking last-affected +1 commits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m hesitant to do cherrypicking for lastAffected because it doesn't guarantee a single commit that contains the code fix like a fixed commit does.
Especially if the lastAffected commit is derived from a version tag, then all we can be sure is that a fix commit exists in (current version, next version].
We can always add this logic later if we have more stats that support it.

}
}

keys := slices.Collect(maps.Keys(unique))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to do some sorting for tests to pass? Or is it fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used cmpopts.SortSlices() in the test file to make sure tests pass deterministically.

Copy link
Member

@michaelkedar michaelkedar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - some initial comments

semaphore chan struct{} // Request concurrency control
)

var validURLRegex = regexp.MustCompile(`^(https?|git|ssh)://`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know this isn't new, but) do we actually support ssh? I don't think the gitter on GKE has any credentials to use with it

Comment on lines 388 to +389
// Check if url starts with protocols: http(s)://, git://, ssh://, (s)ftp://
if match, _ := regexp.MatchString("^(https?|git|ssh)://", url); !match {
if !validURLRegex.MatchString(url) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment says it supports ftp, but the regex does not (+ same on the comments in other endpoints)

Comment on lines +622 to +623
// I can't change singleflight's interface
repoAny, err, _ := gLoad.Do(repoPath, func() (any, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this PR, but if we're using singleflight heavily, and the any typing is clumsy, I reckon we could make a small wrapper with generics

type Group[T any] struct {
  delegate singleflight.Group
}

func (g *Group[T]) Do(key string, fn func() (T, error)) (T, error, bool) { // or whatever's preferable
  // call delegate.Do with wrapped functions and do the casting here
}

Comment on lines +512 to +513
// Remove from affected list if it was reached via a previous non-fixed path.
delete(affectedFromIntro, unaffected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not have to check if any of the commits here were introduced commits?

e.g. if it's like

I1 -> F1 -> I2 -> F2

If you do I2 first, then I1 would all the commits after F1 become unaffected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, since affectedFromIntro is per loop right? Otherwise we'll have problems from fixes and introduced being in parallel as well.

}

// Affected returns a list of commits that are affected by the given introduced, fixed and last_affected events
func (r *Repository) Affected(ctx context.Context, introStrs, fixedStrs, laStrs []string, cherrypickIntro, cherrypickFixed bool) []*Commit {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to move some of these structs/functions to a go/internal/gitter, so that other binaries may reuse the logic if they want to bypass the gitter service (e.g. for a datafix-type thing).

}

// Between walks and returns the commits that are strictly between introduced (inclusive) and limit (exclusive)
func (r *Repository) Limit(ctx context.Context, introStrs, limitStrs []string) []*Commit {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want cherrypick detection here?

}

// Testing cases with introduced and fixed only.
func TestAffected_Introduced_Fixed(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for the edge cases when introduced, fixed and/or last_affected are equal to each other.
(and I suspect the logic may need to be adjusted to account for these)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tests for introduced = 0

Comment on lines +18 to +25
message AffectedCommit {
bytes hash = 1;
repeated string refs = 2;
}

message AffectedCommitsResponse {
repeated AffectedCommit commits = 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to move refs to the top level response (as a mapping of ref -> hash)

return hashes
}

// Affected returns a list of commits that are affected by the given introduced, fixed and last_affected events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth restating our definition of "affected" here in the code, where we're more likely to see it (and so help the reader understand why the code does what it does)?

Another point that might make sense to recap is just before the loop over the "introduced" commits.

Copy link
Contributor

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh btw for the bfs and the queue, did you know about: https://pkg.go.dev/container/list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants