Skip to content

feat: defensive identity overrides for MavenComponent and GitComponent#1822

Open
zhenghao104 wants to merge 3 commits into
mainfrom
users/zhentan/sbom-identity-overrides-maven-git
Open

feat: defensive identity overrides for MavenComponent and GitComponent#1822
zhenghao104 wants to merge 3 commits into
mainfrom
users/zhentan/sbom-identity-overrides-maven-git

Conversation

@zhenghao104
Copy link
Copy Markdown
Contributor

Summary

Two small defensive identity overrides on TypedComponent subclasses, plus a PackageUrl override on GitComponent so github.com repositories get a canonical pkg:github/... PURL.

Component Change
MavenComponent Override GetExtendedIdProperties() to return empty (suppress DownloadUrl/SourceUrl from Id).
GitComponent Same GetExtendedIdProperties() override + new PackageUrl override returning pkg:github/{owner}/{repo}@{commit} for github.com URLs and null for other hosts.

Why

Today no detector populates DownloadUrl/SourceUrl on MavenComponent or GitComponent, so Id == BaseId. Upcoming SBOM-enrichment work in microsoft/component-detection-internal will start emitting deterministic download URLs through the Maven/Git converters. Without these overrides, Id would change shape to "BaseId [DownloadUrl:... SourceUrl:...]", creating bare-vs-rich identity-merge conflicts between:

  • the two MvnCli internal paths (mvn-CLI output vs. static pom.xml parser), and
  • the multiple Git submitters (CocoaPods / Poetry / Pip / Swift / Ruby / cgmanifest).

GitComponent.PackageUrl was the only typed component returning null. Adding the github PURL aligns it with every other type and lets downstream tooling (SBOM, AzDO Governance) dedupe github coords by PURL instead of by raw repository URL.

Behavior changes

  • MavenId already equaled BaseId for every existing detector; the override locks the contract going forward. No observable change today.
  • GitId likewise unchanged today. PackageUrl goes from null"pkg:github/owner/repo@commit" for github.com repositories only. Other hosts (GitLab / Bitbucket / Azure DevOps / GitHub Enterprise) keep returning null — consumers should continue to fall back to RepositoryUrl for those.
    • Searched both this repo and microsoft/component-detection-internal for callers that assert GitComponent.PackageUrl == null — none found. VerificationTests uses PackageUrl in a dedup key but both sides of the intersection run the same binary, so keys still match.

Edge cases covered

URL shape Result
https://github.com/google/guava pkg:github/google/guava@<commit>
https://github.com/google/guava.git .git suffix stripped
https://github.com/google/guava/ trailing slash normalised
https://GitHub.com/google/guava host match is case-insensitive
https://github.com:443/google/guava port stripped via Uri.Host
https://github.com/google/guava?utm=x#readme query/fragment ignored via Uri.AbsolutePath
https://gitlab.com/foo/bar, https://bitbucket.org/foo/bar, https://dev.azure.com/..., https://github.contoso.com/foo/bar null
https://github.com/onlyowner, https://github.com/owner/repo/tree/main, https://github.com/ null (not canonical owner/repo paths)
SSH-style git@github.com:owner/repo.git not a parseable absolute Uri; rejected upstream
RepositoryUrl == null or CommitHash null/empty (deserialization path) null

Tests

Microsoft.ComponentDetection.Contracts.Tests+10 tests, 84/84 passing:

  • PurlGenerationTests — explicit MavenComponentShouldGenerateMavenPurl plus 7 GitHub PURL cases: canonical, .git suffix, trailing slash, case-insensitive host, non-github hosts (gitlab + bitbucket + ADO + GHE in one test), malformed paths (owner-only + too-deep + root-only in one test), missing commit hash.
  • TypedComponentSerializationTestsMavenComponent_Id_ExcludesDownloadUrlAndSourceUrl_WhenSet and GitComponent_Id_ExcludesDownloadUrlAndSourceUrl_WhenSet prove the overrides exclude both URLs even when set.

Broader regression check (all run locally on this branch):

Project Result
Contracts.Tests 84 / 84 passed (incl. 10 new)
Detectors.Tests 910 / 910 passedGradleComponentDetectorTests and others that inspect Id strings like "... - Maven" still match because only GetExtendedIdProperties() changed, not ComputeBaseId().
Orchestrator.Tests 151 / 151 passed
Common.Tests 213 / 221 passed (1 pre-existing flake: DockerService_CanPingDocker needs a Docker daemon; the other 7 skips are platform-conditioned)

Out of scope

  • The CD-Internal converter changes (consume the new GitComponent.PackageUrl, emit deterministic DownloadURL for both Maven and Git) ship in a follow-up PR in microsoft/component-detection-internal once a new CD-public NuGet drops.
  • A pre-existing bug at GitComponentConverter.cs:28 in CD-Internal (literal Replace("https://github.com/", "") that produces garbage owner/repo for non-github URLs) — same follow-up PR.

MavenComponent
- Override GetExtendedIdProperties() to suppress DownloadUrl/SourceUrl
  from Id. GAV is the canonical Maven identity; the Maven Central
  download URL is deterministic from GAV and SourceUrl is surfaced
  server-side from the POM - neither should affect identity.

GitComponent
- Add PackageUrl override returning pkg:github/{owner}/{repo}@{commit}
  for repositories hosted on github.com (case-insensitive host match,
  .git suffix stripped, trailing slash normalised, path must resolve
  cleanly to owner/repo). Returns null for any other host (gitlab,
  bitbucket, ADO, GitHub Enterprise) and for malformed paths;
  consumers should fall back to RepositoryUrl in that case.
- Override GetExtendedIdProperties() defensively for the same reason
  as MavenComponent.

Tests
- PurlGenerationTests: explicit Maven coverage plus 7 GitHub PURL
  cases (canonical, .git suffix, trailing slash, case-insensitive
  host, non-github hosts, malformed paths, missing commit hash).
- TypedComponentSerializationTests: Id-stability tests proving both
  overrides exclude DownloadUrl/SourceUrl from Id even when set.

No production behaviour change today (no detector currently sets
DownloadUrl/SourceUrl on MavenComponent/GitComponent), but locks
identity stability ahead of upcoming SBOM enrichment work that will
populate DownloadUrl in the CD-Internal converter.
Copilot AI review requested due to automatic review settings June 6, 2026 00:38
@zhenghao104 zhenghao104 requested a review from a team as a code owner June 6, 2026 00:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens identity stability for MavenComponent and GitComponent by ensuring optional provenance metadata (DownloadUrl / SourceUrl) never changes their TypedComponent.Id, and it adds a canonical pkg:github/... Package URL for GitComponent when the repository is hosted on github.com. This aligns Git components with other typed components that already emit PURLs, improving downstream deduplication and SBOM interoperability.

Changes:

  • Override GetExtendedIdProperties() in MavenComponent and GitComponent to suppress DownloadUrl/SourceUrl from contributing to Id.
  • Implement GitComponent.PackageUrl to emit pkg:github/{owner}/{repo}@{commit} for canonical github.com/{owner}/{repo} URLs (including .git suffix and trailing slash normalization), returning null otherwise.
  • Add/extend contract tests covering both the identity-stability behavior and GitHub PURL generation edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/Microsoft.ComponentDetection.Contracts/TypedComponent/MavenComponent.cs Keeps Maven Id stable by excluding provenance URLs from extended identity properties.
src/Microsoft.ComponentDetection.Contracts/TypedComponent/GitComponent.cs Keeps Git Id stable and adds canonical GitHub PURL generation for github.com repositories.
test/Microsoft.ComponentDetection.Contracts.Tests/TypedComponentSerializationTests.cs Adds tests asserting Id does not change when DownloadUrl/SourceUrl are set for Maven/Git components.
test/Microsoft.ComponentDetection.Contracts.Tests/PurlGenerationTests.cs Adds tests validating Maven PURL and multiple GitHub PURL normalization/guardrail cases for GitComponent.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 8, 2026

👋 Hi! It looks like you modified some files in the Detectors folder.
You may need to bump the detector versions if any of the following scenarios apply:

  • The detector detects more or fewer components than before
  • The detector generates different parent/child graph relationships than before
  • The detector generates different devDependencies values than before

If none of the above scenarios apply, feel free to ignore this comment 🙂

@RushabhBhansali
Copy link
Copy Markdown
Contributor

does verification tests currently test for PackageUrl ? We should probably update the pom file here to ensure that test data has some packageUrls.

Update verification Maven resource test/Microsoft.ComponentDetection.VerificationTests/resources/maven/lib/pom.xml to make PackageUrl-producing coordinates explicit:
- add module-level <groupId>com.microsoft</groupId>
- add literal dependency org.apache.commons:commons-lang3:3.12.0

This keeps existing semantics intact while ensuring at least one stable, non-property-indirected Maven coordinate is present in the fixture for verification comparisons that key on Component.PackageUrl.
Copilot AI review requested due to automatic review settings June 8, 2026 21:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +51 to +55
if (string.IsNullOrEmpty(this.CommitHash)
|| !TryGetGithubOwnerAndRepo(this.RepositoryUrl, out var owner, out var repo))
{
return null;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants