feat(gitter): affected commits graph walking#4974
feat(gitter): affected commits graph walking#4974Ly-Joey wants to merge 27 commits intogoogle:masterfrom
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality for determining affected commits by walking the git graph. However, it also introduces several critical security vulnerabilities, primarily related to command injection and SSRF. The most severe issue is in buildCommitGraph, where user-controlled data is concatenated into a shell command, and new endpoints lack the protocol validation present in existing ones, allowing for command execution via git protocols. Additionally, a potential Denial of Service vulnerability due to unvalidated input length during hash parsing was identified. Beyond these security concerns, a critical bug was found in the graph building logic, and there are opportunities to improve code structure and testing practices to enhance maintainability and robustness.
4227a68 to
97b9e25
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
The pull request introduces a new /affected-commits API endpoint to the gitter service, enabling clients to query for commits affected by specific events (introduced, fixed, last_affected, limit) within a Git repository. This includes implementing graph traversal logic (Affected and Limit methods) with support for cherry-pick detection, defining new Protobuf messages for the API response, and adding URL validation to both the new and existing /cache handlers. The Commit struct was updated to use Refs instead of Tags, and rootCommits were added to the Repository struct. Review comments highlighted several areas for improvement: optimizing regex compilation for URL validation in both cacheHandler and affectedCommitsHandler by compiling it once at the package level; enhancing test robustness and readability in TestAffectedCommitsHandler by using EventType constants instead of hardcoded strings; clarifying the Limit function's assumption about linear history when traversing commits by following only the first parent; and correcting a bug in the decodeSHA1 helper function in repository_test.go where fmt.Sprintf incorrectly pads with spaces instead of zeros, which would cause hex.DecodeString to fail.
|
I'm not sure that the definition of "affected" is quite right:
As phrased, I don't think it's made explicit that only commits where the introduced is an ancestor are affected. How does something like this sound (just a suggestion, please feel free to adjust)?
|
| } | ||
| logger.InfoContext(ctx, "Received request: /affected-commits", slog.Any("introduced", introduced), slog.Any("fixed", fixed), slog.Any("last_affected", lastAffected), slog.Any("limit", limit), slog.Bool("cherrypickIntro", cherrypickIntro), slog.Bool("cherrypickFixed", cherrypickFixed)) | ||
|
|
||
| semaphore <- struct{}{} |
There was a problem hiding this comment.
hmm originally I think this semaphore was meant to be inside the singleflight.
| repoDirName := getRepoDirName(url) | ||
| repoPath := filepath.Join(gitStorePath, repoDirName) | ||
|
|
||
| repoLock := GetRepoLock(url) | ||
| repoLock.RLock() | ||
| defer repoLock.RUnlock() |
There was a problem hiding this comment.
Ideally I think the repo lock can be attached to the repo itself, so you can't access the underlying repo without locking. Can we do something like this:
Make fetchRepo return a repo struct, which contains:
- path to repo
- *Repository (nil by default) and private
-
- to the repo mutex in the global map
then Load can be a function on the repo struct, which will do the locking before loading the repository. Just an idea at the moment, you probably can play around with the types a bit to improve it.
| if cherrypickIntro { | ||
| introduced = r.expandByCherrypick(introduced) | ||
| } | ||
| if cherrypickFixed { |
There was a problem hiding this comment.
Should be be cherrypicking last-affected +1 commits?
There was a problem hiding this comment.
I’m hesitant to do cherrypicking for lastAffected because it doesn't guarantee a single commit that contains the code fix like a fixed commit does.
Especially if the lastAffected commit is derived from a version tag, then all we can be sure is that a fix commit exists in (current version, next version].
We can always add this logic later if we have more stats that support it.
go/cmd/gitter/repository.go
Outdated
| } | ||
| } | ||
|
|
||
| keys := slices.Collect(maps.Keys(unique)) |
There was a problem hiding this comment.
Does this need to do some sorting for tests to pass? Or is it fine.
There was a problem hiding this comment.
I used cmpopts.SortSlices() in the test file to make sure tests pass deterministically.
michaelkedar
left a comment
There was a problem hiding this comment.
Looking good - some initial comments
| semaphore chan struct{} // Request concurrency control | ||
| ) | ||
|
|
||
| var validURLRegex = regexp.MustCompile(`^(https?|git|ssh)://`) |
There was a problem hiding this comment.
(I know this isn't new, but) do we actually support ssh? I don't think the gitter on GKE has any credentials to use with it
| // Check if url starts with protocols: http(s)://, git://, ssh://, (s)ftp:// | ||
| if match, _ := regexp.MatchString("^(https?|git|ssh)://", url); !match { | ||
| if !validURLRegex.MatchString(url) { |
There was a problem hiding this comment.
nit: the comment says it supports ftp, but the regex does not (+ same on the comments in other endpoints)
go/cmd/gitter/gitter.go
Outdated
| // I can't change singleflight's interface | ||
| repoAny, err, _ := gLoad.Do(repoPath, func() (any, error) { |
There was a problem hiding this comment.
not in this PR, but if we're using singleflight heavily, and the any typing is clumsy, I reckon we could make a small wrapper with generics
type Group[T any] struct {
delegate singleflight.Group
}
func (g *Group[T]) Do(key string, fn func() (T, error)) (T, error, bool) { // or whatever's preferable
// call delegate.Do with wrapped functions and do the casting here
}
go/cmd/gitter/repository.go
Outdated
| // Remove from affected list if it was reached via a previous non-fixed path. | ||
| delete(affectedFromIntro, unaffected) |
There was a problem hiding this comment.
Do you not have to check if any of the commits here were introduced commits?
e.g. if it's like
I1 -> F1 -> I2 -> F2
If you do I2 first, then I1 would all the commits after F1 become unaffected?
There was a problem hiding this comment.
I think that's fine, since affectedFromIntro is per loop right? Otherwise we'll have problems from fixes and introduced being in parallel as well.
go/cmd/gitter/repository.go
Outdated
| } | ||
|
|
||
| // Affected returns a list of commits that are affected by the given introduced, fixed and last_affected events | ||
| func (r *Repository) Affected(ctx context.Context, introStrs, fixedStrs, laStrs []string, cherrypickIntro, cherrypickFixed bool) []*Commit { |
There was a problem hiding this comment.
It might be useful to move some of these structs/functions to a go/internal/gitter, so that other binaries may reuse the logic if they want to bypass the gitter service (e.g. for a datafix-type thing).
go/cmd/gitter/repository.go
Outdated
| } | ||
|
|
||
| // Between walks and returns the commits that are strictly between introduced (inclusive) and limit (exclusive) | ||
| func (r *Repository) Limit(ctx context.Context, introStrs, limitStrs []string) []*Commit { |
There was a problem hiding this comment.
do we want cherrypick detection here?
| } | ||
|
|
||
| // Testing cases with introduced and fixed only. | ||
| func TestAffected_Introduced_Fixed(t *testing.T) { |
There was a problem hiding this comment.
Can we add some tests for the edge cases when introduced, fixed and/or last_affected are equal to each other.
(and I suspect the logic may need to be adjusted to account for these)
There was a problem hiding this comment.
Also tests for introduced = 0
| message AffectedCommit { | ||
| bytes hash = 1; | ||
| repeated string refs = 2; | ||
| } | ||
|
|
||
| message AffectedCommitsResponse { | ||
| repeated AffectedCommit commits = 1; | ||
| } |
There was a problem hiding this comment.
we probably want to move refs to the top level response (as a mapping of ref -> hash)
| return hashes | ||
| } | ||
|
|
||
| // Affected returns a list of commits that are affected by the given introduced, fixed and last_affected events |
There was a problem hiding this comment.
Is it worth restating our definition of "affected" here in the code, where we're more likely to see it (and so help the reader understand why the code does what it does)?
Another point that might make sense to recap is just before the loop over the "introduced" commits.
another-rex
left a comment
There was a problem hiding this comment.
Oh btw for the bfs and the queue, did you know about: https://pkg.go.dev/container/list
Removed isIntroduced arg from parseHashes since it's unnecessary.
Add graph related logic and find affected commits.
Definition of affected commits:
Key changes
New endpoint:
POST /affected-commits:Cherry-pick Detection (optional):
Expand introduced and fixed commits to include their cherry-picked equivalents (determined with PatchID)
There's a follow-up PR (WIP) to refactor gitter (the repetitive fetches and semaphore logic, etc) and add LRU cache for repos.