Skip to content

Add CRC32C checksum based validation in GCS Avro to Cloud Spanner pipeline#3509

Merged
bharadwaj-aditya merged 1 commit intoGoogleCloudPlatform:mainfrom
adityatulasi-google:main
Apr 15, 2026
Merged

Add CRC32C checksum based validation in GCS Avro to Cloud Spanner pipeline#3509
bharadwaj-aditya merged 1 commit intoGoogleCloudPlatform:mainfrom
adityatulasi-google:main

Conversation

@adityatulasi-google
Copy link
Copy Markdown
Contributor

No description provided.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data integrity validation capabilities within the GCS Avro to Cloud Spanner pipeline. It introduces support for CRC32C checksums, providing an alternative validation mechanism when MD5 checksums are unavailable. This change improves the robustness of the import process by ensuring that data consistency can be verified using a widely accepted checksum algorithm, thereby reducing the risk of data corruption during transfer.

Highlights

  • CRC32C Checksum Support: Introduced new utility methods in FileChecksum.java to calculate CRC32C checksums for both local and Google Cloud Storage (GCS) files.
  • Enhanced File Validation Logic: Modified ImportTransform.java to prioritize MD5 checksum validation, but now gracefully fall back to CRC32C validation if MD5 is not provided in the manifest. An error is thrown if neither checksum type is found.
  • Protobuf Schema Update: Updated the export.proto schema to include a crc32c field within the TableManifest.File message, allowing manifests to specify CRC32C checksums.
  • New Test Cases: Added comprehensive test cases in ImportTransformTest.java to verify the correct functionality of CRC32C checksum validation and to ensure proper error handling when no checksum is present.
Changelog
  • v1/src/main/java/com/google/cloud/teleport/spanner/FileChecksum.java
    • Added getLocalFileChecksumCrc32c method to compute CRC32C for local files.
    • Added getGcsFileChecksumsCrc32c method to compute CRC32C for GCS files.
  • v1/src/main/java/com/google/cloud/teleport/spanner/ImportTransform.java
    • Modified validateGcsFiles to check for CRC32C if MD5 is not present in the manifest.
    • Modified validateLocalFiles to check for CRC32C if MD5 is not present in the manifest.
    • Added error handling to throw a RuntimeException if neither MD5 nor CRC32C checksum is found for a file.
  • v1/src/main/proto/export.proto
    • Added crc32c field to the TableManifest.File message.
  • v1/src/test/java/com/google/cloud/teleport/spanner/ImportTransformTest.java
    • Added validateInputFilesWithCrc32c test case to verify CRC32C validation.
    • Added validateInputFilesNoChecksum test case to ensure an error is thrown when no checksum is provided.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 35.06494% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.45%. Comparing base (e42572e) to head (7579f87).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
...google/cloud/teleport/spanner/ExportTransform.java 35.71% 16 Missing and 2 partials ⚠️
...google/cloud/teleport/spanner/ImportTransform.java 41.37% 17 Missing ⚠️
...om/google/cloud/teleport/spanner/FileChecksum.java 13.33% 13 Missing ⚠️
.../google/cloud/teleport/spanner/ExportPipeline.java 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3509      +/-   ##
============================================
- Coverage     52.49%   52.45%   -0.05%     
- Complexity     5790     5814      +24     
============================================
  Files          1057     1062       +5     
  Lines         63968    64198     +230     
  Branches       7052     7101      +49     
============================================
+ Hits          33582    33673      +91     
- Misses        28110    28240     +130     
- Partials       2276     2285       +9     
Components Coverage Δ
spanner-templates 72.03% <35.06%> (-0.31%) ⬇️
spanner-import-export 68.51% <35.06%> (-0.35%) ⬇️
spanner-live-forward-migration 80.87% <ø> (ø)
spanner-live-reverse-replication 77.54% <ø> (-0.70%) ⬇️
spanner-bulk-migration 89.32% <ø> (ø)
gcs-spanner-dv 85.75% <ø> (ø)
Files with missing lines Coverage Δ
.../google/cloud/teleport/spanner/ExportPipeline.java 9.37% <60.00%> (+9.37%) ⬆️
...om/google/cloud/teleport/spanner/FileChecksum.java 12.90% <13.33%> (+0.40%) ⬆️
...google/cloud/teleport/spanner/ImportTransform.java 22.99% <41.37%> (+0.74%) ⬆️
...google/cloud/teleport/spanner/ExportTransform.java 16.44% <35.71%> (+0.70%) ⬆️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@bharadwaj-aditya bharadwaj-aditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be changes to the export pipeline to provide this as well.

Comment thread v1/src/main/java/com/google/cloud/teleport/spanner/ImportTransform.java Outdated
@adityatulasi-google
Copy link
Copy Markdown
Contributor Author

There should be changes to the export pipeline to provide this as well.

There should be changes to the export pipeline to provide this as well.

It would not make sense to export CRC32C checksum for all exports. Should I introduce a parameter like below and use it?

    @TemplateParameter.Boolean(
        order = 14,
        groupName = "Source",
        optional = true,
        description = "Use CRC32C checksum in manifests",
        helpText = "Use CRC32C checksum instead of MD5 checksum for Avro files in the created manifests.")
    @Default.Boolean(false)
    ValueProvider<Boolean> useCrc32cChecksum();

Copy link
Copy Markdown
Contributor Author

@adityatulasi-google adityatulasi-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to open comments. Please take another look.

@anowardear062-svg
Copy link
Copy Markdown

Support and proses earnings

Copy link
Copy Markdown
Contributor Author

@adityatulasi-google adityatulasi-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably last iteration before making the changes.

Copy link
Copy Markdown
Contributor

@bharadwaj-aditya bharadwaj-aditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment as suggested around the reasoning for the structure of the code.

LGTM overall

@bharadwaj-aditya
Copy link
Copy Markdown
Contributor

There should be changes to the export pipeline to provide this as well.

There should be changes to the export pipeline to provide this as well.

It would not make sense to export CRC32C checksum for all exports. Should I introduce a parameter like below and use it?

    @TemplateParameter.Boolean(
        order = 14,
        groupName = "Source",
        optional = true,
        description = "Use CRC32C checksum in manifests",
        helpText = "Use CRC32C checksum instead of MD5 checksum for Avro files in the created manifests.")
    @Default.Boolean(false)
    ValueProvider<Boolean> useCrc32cChecksum();

Please make this an ENUM input with default as MDF

Copy link
Copy Markdown
Contributor Author

@adityatulasi-google adityatulasi-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for CRC32C in Export pipeline as well.

Comment thread v1/src/main/java/com/google/cloud/teleport/spanner/ImportTransform.java Outdated
Copy link
Copy Markdown
Contributor Author

@adityatulasi-google adityatulasi-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot add default value to Enums in v1 templates.

Copy link
Copy Markdown
Contributor

@bharadwaj-aditya bharadwaj-aditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the job does not work with incorrect parameters - so marking LGTM

Copy link
Copy Markdown
Contributor Author

@adityatulasi-google adityatulasi-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed spotless errors. PTAL.

Copy link
Copy Markdown
Contributor

@bharadwaj-aditya bharadwaj-aditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bharadwaj-aditya bharadwaj-aditya merged commit 0c63c66 into GoogleCloudPlatform:main Apr 15, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

addition New feature or request size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants