Skip to content

Conversation

@coderfender
Copy link
Contributor

@coderfender coderfender commented Jan 31, 2026

Which issue does this PR close?

Closes: #3209

3 kinds of tests have been identified as incompatible for native writer :

  1. We have 12 tests in SQL suite which failed / exposed corner cases with Comet writers ( Insert Overwrite / CTAS statements)
  2. SQL tests for the charvarcha.sql files
  3. Parquet metadata encoding tests (Spark writes metadata such as spark version while comet doesnt)

Rationale for this change

Now that we have a native comet writer, this PR enables spark tests while ignoring the ones which fail

What changes are included in this PR?

Diff file changes to enable comet native writer on spark tests

How are these changes tested?

@coderfender coderfender force-pushed the enable_spark_tests_parquet_writer branch from b5cea4f to 951b394 Compare January 31, 2026 01:35
@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.99%. Comparing base (f09f8af) to head (6c6a142).
⚠️ Report is 918 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3351      +/-   ##
============================================
+ Coverage     56.12%   59.99%   +3.86%     
- Complexity      976     1464     +488     
============================================
  Files           119      175      +56     
  Lines         11743    16165    +4422     
  Branches       2251     2681     +430     
============================================
+ Hits           6591     9698    +3107     
- Misses         4012     5114    +1102     
- Partials       1140     1353     +213     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderfender
Copy link
Contributor Author

coderfender commented Jan 31, 2026

@comphead , We have bunch of failures largely due to hitting edge cases with comet writer (CTAS , Insert Overwrite which is likely a bug) . I am thinking of two possible options here

  1. Keep the changes minimal and ignore the failing I/O tests for now
  2. Create a separate comet writer suite which ignores all the tests relevant to Comet native writer

@coderfender
Copy link
Contributor Author

@comphead please kickoff CI whenever you get a chance

@coderfender
Copy link
Contributor Author

My diff files for Spark versions 3.4 and 3.5 missed disabling comet writer for CTAS statements (while it was set for Spark 4) causing above test failures . However, there are only 2 failed tests down from 13

@coderfender coderfender changed the title feat : enable_spark_tests_comet_native_writer feat: enable_spark_tests_comet_native_writer Feb 1, 2026
@coderfender
Copy link
Contributor Author

Seems like there were issues applying diff file for Spark 4

@coderfender
Copy link
Contributor Author

@comphead all the tests passed .Please take a look whenever you get a chance

@coderfender
Copy link
Contributor Author

All tests passed and the following tests are ignored :

  Common across all Spark versions (3.4, 3.5, 4.0) :

  1. empty file should be skipped while write to file
  2. INSERT INTO TABLE - complex type but different names
  3. Insert overwrite table command should output correct schema: basic
  4. parquet timestamp conversion
  5. SPARK-29174 Support LOCAL in INSERT OVERWRITE DIRECTORY to data source
  6. SPARK-33901: ctas should should not change table's schema
  7. SPARK-37160: CREATE TABLE AS SELECT with CHAR_AS_VARCHAR
  8. SPARK-38336 INSERT INTO statements with tables with default columns: positive tests
  9. SPARK-38811 INSERT INTO on columns added with ALTER TABLE ADD COLUMNS: Positive tests
  10. SPARK-43071: INSERT INTO from queries whose final operators are not projections
  11. write path implements onTaskCommit API correctly
  12. Write Spark version into Parquet metadata

Additional tests which had to be ignored for spark 4 : 

  1. ctas with union
  2. SPARK-48817: test multi inserts

  Total: 12 tests (Spark 3.4/3.5) | 14 tests (Spark 4.0)

@andygrove andygrove changed the title feat: enable_spark_tests_comet_native_writer feat: Enable native Parquet writer in Spark SQL tests Feb 4, 2026

- test("SPARK-33901: ctas should should not change table's schema") {
+ test("SPARK-33901: ctas should should not change table's schema",
+ IgnoreComet("comet native writer does not support empty dir / table creation yet")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to the specific GitHub issues in the IgnoreComet annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you . I did annotations and pushed a commit . However, per @comphead 's review, I made this PR just to copy the failing tests into the CometParquetWriterSuite and created github issues for each of the failing tests in spark 3x , spark 4x.

The plan is to create a CI pipeline (which doesn't interfere with main comet workflows but runs a separate test suite with comet writer enabled and a listener attached to make sure we are running the tests through native writer

@coderfender
Copy link
Contributor Author

Fixing merge conflicts for the diff files :)

@coderfender coderfender force-pushed the enable_spark_tests_parquet_writer branch from 6c6a142 to b60ea7e Compare February 6, 2026 07:50
@coderfender
Copy link
Contributor Author

Created github issues for all the failing tests and added links in IgnoreComet annotations

@coderfender coderfender force-pushed the enable_spark_tests_parquet_writer branch from 72580df to 5064d58 Compare February 7, 2026 07:45
@coderfender coderfender force-pushed the enable_spark_tests_parquet_writer branch from fec7c20 to 370d98d Compare February 7, 2026 19:15
@mbutrovich
Copy link
Contributor

Rather than ignoring tests in diffs, are we able to identify these scenarios at planning time and write fallbacks?

@coderfender
Copy link
Contributor Author

@mbutrovich , thank you for the comment. These tests were essentially exposing bugs with our writer which are not caught by our current fallbacks

  1. Empty file/dir creation
    empty file should be skipped while write to file
    SPARK-33901: ctas should not change table's schema
    SPARK-37160: CREATE TABLE AS SELECT with CHAR_AS_VARCHAR

  2. Insert overwrite bugs (5 tests)
    Insert overwrite table command should output correct schema: basic
    SPARK-29174 Support LOCAL in INSERT OVERWRITE DIRECTORY
    SPARK-38336 INSERT INTO with default columns
    SPARK-38811 INSERT INTO on ALTER TABLE ADD COLUMNS
    SPARK-43071: INSERT INTO from non-projection queries

  3. Other
    parquet timestamp conversion (timestamp96)
    INSERT INTO TABLE - complex type but different names
    Write Spark version into Parquet metadata
    write path implements onTaskCommit API correctly

I created separate issues for each failing tests, annotated them in the diffs . However, per my discussion with @comphead , this PR is just the first step to copy those tests into comet test suite to lay foundation for a future comet writer only test suite in CI .

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderfender great to identify issues for the writer.
One question how is it planned to check native writer has been called?
You might consider the option to call writeWithCometNativeWriteExec or look into how SparkListener used inside

@coderfender
Copy link
Contributor Author

Great suggestion @comphead . I initially added a custom listerner to check that the native writer is called. However, I figured it might be cleaner to limit the scope of this PR to just tag the failing tests and follow up with more work to build on this. Do you think it is a good idea to add that verification in this PR or would you prefer the listener changes in a different one ?

@coderfender
Copy link
Contributor Author

nevermind I think I got a better idea after this PR : #3480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run Spark tests with Comet native writer

5 participants