feat: defensive identity overrides for MavenComponent and GitComponent#1822
feat: defensive identity overrides for MavenComponent and GitComponent#1822zhenghao104 wants to merge 3 commits into
Conversation
MavenComponent
- Override GetExtendedIdProperties() to suppress DownloadUrl/SourceUrl
from Id. GAV is the canonical Maven identity; the Maven Central
download URL is deterministic from GAV and SourceUrl is surfaced
server-side from the POM - neither should affect identity.
GitComponent
- Add PackageUrl override returning pkg:github/{owner}/{repo}@{commit}
for repositories hosted on github.com (case-insensitive host match,
.git suffix stripped, trailing slash normalised, path must resolve
cleanly to owner/repo). Returns null for any other host (gitlab,
bitbucket, ADO, GitHub Enterprise) and for malformed paths;
consumers should fall back to RepositoryUrl in that case.
- Override GetExtendedIdProperties() defensively for the same reason
as MavenComponent.
Tests
- PurlGenerationTests: explicit Maven coverage plus 7 GitHub PURL
cases (canonical, .git suffix, trailing slash, case-insensitive
host, non-github hosts, malformed paths, missing commit hash).
- TypedComponentSerializationTests: Id-stability tests proving both
overrides exclude DownloadUrl/SourceUrl from Id even when set.
No production behaviour change today (no detector currently sets
DownloadUrl/SourceUrl on MavenComponent/GitComponent), but locks
identity stability ahead of upcoming SBOM enrichment work that will
populate DownloadUrl in the CD-Internal converter.
There was a problem hiding this comment.
Pull request overview
This PR hardens identity stability for MavenComponent and GitComponent by ensuring optional provenance metadata (DownloadUrl / SourceUrl) never changes their TypedComponent.Id, and it adds a canonical pkg:github/... Package URL for GitComponent when the repository is hosted on github.com. This aligns Git components with other typed components that already emit PURLs, improving downstream deduplication and SBOM interoperability.
Changes:
- Override
GetExtendedIdProperties()inMavenComponentandGitComponentto suppressDownloadUrl/SourceUrlfrom contributing toId. - Implement
GitComponent.PackageUrlto emitpkg:github/{owner}/{repo}@{commit}for canonicalgithub.com/{owner}/{repo}URLs (including.gitsuffix and trailing slash normalization), returningnullotherwise. - Add/extend contract tests covering both the identity-stability behavior and GitHub PURL generation edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/Microsoft.ComponentDetection.Contracts/TypedComponent/MavenComponent.cs |
Keeps Maven Id stable by excluding provenance URLs from extended identity properties. |
src/Microsoft.ComponentDetection.Contracts/TypedComponent/GitComponent.cs |
Keeps Git Id stable and adds canonical GitHub PURL generation for github.com repositories. |
test/Microsoft.ComponentDetection.Contracts.Tests/TypedComponentSerializationTests.cs |
Adds tests asserting Id does not change when DownloadUrl/SourceUrl are set for Maven/Git components. |
test/Microsoft.ComponentDetection.Contracts.Tests/PurlGenerationTests.cs |
Adds tests validating Maven PURL and multiple GitHub PURL normalization/guardrail cases for GitComponent. |
|
👋 Hi! It looks like you modified some files in the
If none of the above scenarios apply, feel free to ignore this comment 🙂 |
|
does verification tests currently test for PackageUrl ? We should probably update the pom file here to ensure that test data has some packageUrls. |
Update verification Maven resource test/Microsoft.ComponentDetection.VerificationTests/resources/maven/lib/pom.xml to make PackageUrl-producing coordinates explicit: - add module-level <groupId>com.microsoft</groupId> - add literal dependency org.apache.commons:commons-lang3:3.12.0 This keeps existing semantics intact while ensuring at least one stable, non-property-indirected Maven coordinate is present in the fixture for verification comparisons that key on Component.PackageUrl.
| if (string.IsNullOrEmpty(this.CommitHash) | ||
| || !TryGetGithubOwnerAndRepo(this.RepositoryUrl, out var owner, out var repo)) | ||
| { | ||
| return null; | ||
| } |
Summary
Two small defensive identity overrides on
TypedComponentsubclasses, plus aPackageUrloverride onGitComponentso github.com repositories get a canonicalpkg:github/...PURL.MavenComponentGetExtendedIdProperties()to return empty (suppressDownloadUrl/SourceUrlfromId).GitComponentGetExtendedIdProperties()override + newPackageUrloverride returningpkg:github/{owner}/{repo}@{commit}for github.com URLs andnullfor other hosts.Why
Today no detector populates
DownloadUrl/SourceUrlonMavenComponentorGitComponent, soId == BaseId. Upcoming SBOM-enrichment work inmicrosoft/component-detection-internalwill start emitting deterministic download URLs through the Maven/Git converters. Without these overrides,Idwould change shape to"BaseId [DownloadUrl:... SourceUrl:...]", creating bare-vs-rich identity-merge conflicts between:MvnCliinternal paths (mvn-CLI output vs. static pom.xml parser), andGitComponent.PackageUrlwas the only typed component returningnull. Adding the github PURL aligns it with every other type and lets downstream tooling (SBOM, AzDO Governance) dedupe github coords by PURL instead of by raw repository URL.Behavior changes
Idalready equaledBaseIdfor every existing detector; the override locks the contract going forward. No observable change today.Idlikewise unchanged today.PackageUrlgoes fromnull→"pkg:github/owner/repo@commit"for github.com repositories only. Other hosts (GitLab / Bitbucket / Azure DevOps / GitHub Enterprise) keep returningnull— consumers should continue to fall back toRepositoryUrlfor those.microsoft/component-detection-internalfor callers that assertGitComponent.PackageUrl == null— none found.VerificationTestsusesPackageUrlin a dedup key but both sides of the intersection run the same binary, so keys still match.Edge cases covered
https://github.com/google/guavapkg:github/google/guava@<commit>https://github.com/google/guava.git.gitsuffix strippedhttps://github.com/google/guava/https://GitHub.com/google/guavahttps://github.com:443/google/guavaUri.Hosthttps://github.com/google/guava?utm=x#readmeUri.AbsolutePathhttps://gitlab.com/foo/bar,https://bitbucket.org/foo/bar,https://dev.azure.com/...,https://github.contoso.com/foo/barnullhttps://github.com/onlyowner,https://github.com/owner/repo/tree/main,https://github.com/null(not canonical owner/repo paths)git@github.com:owner/repo.gitUri; rejected upstreamRepositoryUrl == nullorCommitHashnull/empty (deserialization path)nullTests
Microsoft.ComponentDetection.Contracts.Tests— +10 tests, 84/84 passing:PurlGenerationTests— explicitMavenComponentShouldGenerateMavenPurlplus 7 GitHub PURL cases: canonical,.gitsuffix, trailing slash, case-insensitive host, non-github hosts (gitlab + bitbucket + ADO + GHE in one test), malformed paths (owner-only + too-deep + root-only in one test), missing commit hash.TypedComponentSerializationTests—MavenComponent_Id_ExcludesDownloadUrlAndSourceUrl_WhenSetandGitComponent_Id_ExcludesDownloadUrlAndSourceUrl_WhenSetprove the overrides exclude both URLs even when set.Broader regression check (all run locally on this branch):
Contracts.TestsDetectors.TestsGradleComponentDetectorTestsand others that inspectIdstrings like"... - Maven"still match because onlyGetExtendedIdProperties()changed, notComputeBaseId().Orchestrator.TestsCommon.TestsDockerService_CanPingDockerneeds a Docker daemon; the other 7 skips are platform-conditioned)Out of scope
GitComponent.PackageUrl, emit deterministicDownloadURLfor both Maven and Git) ship in a follow-up PR inmicrosoft/component-detection-internalonce a new CD-public NuGet drops.GitComponentConverter.cs:28in CD-Internal (literalReplace("https://github.com/", "")that produces garbageowner/repofor non-github URLs) — same follow-up PR.