perf(metadata): avoid recursive calls for partition listing using catalog by suryaprasanna · Pull Request #18265 · apache/hudi

suryaprasanna · 2026-03-01T21:41:22Z

Describe the issue this Pull Request addresses

When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance.

Summary and Changelog

Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces:

Added CatalogBackedTableMetadata class that fetches partitions from Spark's external catalog
Added FILE_INDEX_PARTITION_LISTING_VIA_CATALOG config to enable catalog-based partition listing
Modified SparkHoodieTableFileIndex to use catalog-backed metadata when metadata table is not available
Added PartitionPathFilterUtil for partition path filtering logic
Refactored BaseHoodieTableFileIndex.createMetadataTable() to be overridable
Added comprehensive unit tests in TestCatalogBackedTableMetadata

Impact

Performance: Reduced latency for partition listing when metadata table is disabled by avoiding recursive filesystem
queries
API Change: Added new config option FILE_INDEX_PARTITION_LISTING_VIA_CATALOG (default: false)
Behavior: When enabled and metadata table is unavailable, partitions are fetched from catalog instead of filesystem

Risk Level

Low - Feature is behind a config flag (disabled by default). Extensive unit tests verify catalog-based partition listing behavior. Fallback to existing filesystem-based approach when config is disabled.

Documentation Update

Config documentation needs to be updated to include the new FILE_INDEX_PARTITION_LISTING_VIA_CATALOG option describing when to enable catalog-based partition listing for performance optimization

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…ableMetadata when metadata is disabled or corrupted Create some unit tests unit tests wip

nsivabalan · 2026-04-02T02:41:41Z

Pushed a commit to address minor feedback and fix tests.

hudi-bot · 2026-04-03T22:43:53Z

CI report:

4fe8df3 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-03T22:47:55Z

Codecov Report

❌ Patch coverage is 69.62025% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.75%. Comparing base (3aef2ca) to head (4fe8df3).
⚠️ Report is 43 commits behind head on master.

Files with missing lines	Patch %	Lines
...che/hudi/metadata/CatalogBackedTableMetadata.scala	52.27%	12 Missing and 9 partials ⚠️
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala	85.71%	0 Missing and 2 partials ⚠️
.../org/apache/hudi/util/PartitionPathFilterUtil.java	75.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18265      +/-   ##
============================================
+ Coverage     68.46%   68.75%   +0.29%     
- Complexity    27472    28080     +608     
============================================
  Files          2427     2448      +21     
  Lines        132655   134528    +1873     
  Branches      15994    16268     +274     
============================================
+ Hits          90819    92494    +1675     
+ Misses        34786    34734      -52     
- Partials       7050     7300     +250

Flag	Coverage Δ
common-and-other-modules	`44.49% <24.35%> (+0.05%)`	⬆️
hadoop-mr-java-client	`44.85% <45.45%> (-0.15%)`	⬇️
spark-client-hadoop-common	`48.49% <63.63%> (+0.28%)`	⬆️
spark-java-tests	`48.76% <30.37%> (-0.07%)`	⬇️
spark-scala-tests	`45.62% <67.08%> (+0.66%)`	⬆️
utilities	`38.34% <21.79%> (-0.29%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...java/org/apache/hudi/BaseHoodieTableFileIndex.java	`83.41% <100.00%> (-0.08%)`	⬇️
...e/hudi/metadata/FileSystemBackedTableMetadata.java	`85.15% <100.00%> (+0.73%)`	⬆️
...pache/hudi/metadata/HoodieBackedTableMetadata.java	`82.56% <100.00%> (-0.23%)`	⬇️
.../org/apache/hudi/metadata/HoodieTableMetadata.java	`76.19% <100.00%> (-2.76%)`	⬇️
...main/scala/org/apache/hudi/DataSourceOptions.scala	`95.49% <100.00%> (+0.09%)`	⬆️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala	`83.08% <100.00%> (+0.32%)`	⬆️
.../org/apache/hudi/util/PartitionPathFilterUtil.java	`75.00% <75.00%> (ø)`
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala	`72.48% <85.71%> (+0.21%)`	⬆️
...che/hudi/metadata/CatalogBackedTableMetadata.scala	`52.27% <52.27%> (ø)`

... and 162 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Accidentally approved

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — One minor formatting issue with method signature indentation in CatalogBackedTableMetadata.scala; otherwise code is clean and readable.

yihua · 2026-04-05T04:44:45Z

+    catalogTable.partitionColumnNames.nonEmpty
+  }
+
+  private def shouldUseCatalogPartitions: Boolean = {


🤖 nit: consider moving the return type annotation to the same line as the method signature for idiomatic Scala style (e.g., override def getAllPartitionPaths: util.List[String] = or with parentheses if preferred).

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The catalog-backed partition listing approach is a smart optimization that avoids expensive recursive filesystem queries. However, there are a few issues to address before merging: the overridable createMetadataTable() call during base-class construction risks NPEs in subclasses, the catalog lookup needs graceful fallback when a table isn't registered, and the List<Object> partition predicates should use a properly typed interface.

yihua · 2026-04-05T04:52:53Z

+    return metaClient.getTableFormat().getMetadataFactory()
+        .create(engineContext, metaClient.getStorage(), metadataConfig, basePath.toString(), true);
  }



🤖 Making createMetadataTable protected (overridable) while it's called from doRefresh() at line 201 during the base class constructor is risky. If a subclass overrides this method and accesses subclass-specific fields (e.g., catalog reference, spark session), those fields won't be initialized yet when the base constructor runs. Could you consider using lazy initialization or a post-construction init() method instead?

yihua · 2026-04-05T04:52:53Z

                                                                   Types.RecordType partitionFields,
                                                                   Expression expression) throws IOException;

+  default List<String> getPartitionPathWithPathPrefixUsingFilterExpression(List<String> relativePathPrefixes,


🤖 Using List<Object> for partitionPredicateExpressions loses all type safety. Callers will need to cast blindly, and any mismatch will only surface as a runtime ClassCastException. Could this be a generic type parameter on the interface, or would a more specific type (even a simple wrapper) work here to avoid the raw Object list?

yihua · 2026-04-05T04:52:53Z


  private static final int DEFAULT_LISTING_PARALLELISM = 1500;

+  @Getter


🤖 Is getDatabaseName() guaranteed to be non-null here? For tables created without a database name in the table config, this would store null. If the catalog-backed path later uses this database name to look up partitions, that could fail. Have you verified the behavior for tables without an explicit database name?

yihua · 2026-04-05T04:52:53Z

+    }
+  private lazy val tableIdentifier = TableIdentifier(catalogTableName, Some(catalogDatabaseName))
+  private lazy val catalogTable = sparkSession.sessionState.catalog.getTableMetadata(tableIdentifier)
+


🤖 If the table isn't registered in the catalog (e.g., created via DataSource API with just a path), getTableMetadata will throw NoSuchTableException. Since catalogTable is accessed by isPartitionedTable and shouldUseCatalogPartitions in every partition listing method, the exception propagates up without any fallback to super (filesystem listing). Could you wrap the catalog access in a Try and fall back to super when the table isn't found?

yihua · 2026-04-05T04:52:53Z

+  private val sparkSession = engineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext.sparkSession
+  private val catalogTableName = tableConfig.getTableName
+  private lazy val catalogDatabaseName =
+    if (StringUtils.isNullOrEmpty(tableConfig.getDatabaseName)) {


🤖 When getDatabaseName returns null/empty, this falls back to getCurrentDatabase, which reflects the session's current database context — not necessarily where the table is registered. If a user runs USE some_other_db before querying, this would look up the wrong database and either throw NoSuchTableException or find a different table with the same name. Would it be safer to resolve the database from the catalog table's metadata (e.g., via the table's location path) or at least document this assumption?

yihua · 2026-04-05T04:52:53Z

+    if (!isPartitionedTable) {
+      util.Collections.emptyList()
+    } else if (shouldUseCatalogPartitions) {
+      val partitionPredicateExpressionSeq = partitionPredicateExpressions.asScala.map(_.asInstanceOf[Expression]).toSeq


🤖 The else branch falls back to the 3-param super call (without partitionPredicateExpressions), which means the Spark partition predicate expressions are silently dropped when the catalog isn't available. Is this intentional? In that case the filesystem listing only uses the Hudi pushedExpr for filtering, which should still be correct, but I wanted to confirm no filter information is lost here.

…alog (apache#18265) When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance. Summary and Changelog Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces: Added CatalogBackedTableMetadata class that fetches partitions from Spark's external catalog Added "hoodie.datasource.read.file.index.list.partitions.from.catalog" config to enable catalog-based partition listing Modified SparkHoodieTableFileIndex to use catalog-backed metadata when metadata table is not available Added PartitionPathFilterUtil for partition path filtering logic Refactored BaseHoodieTableFileIndex.createMetadataTable() to be overridable Added comprehensive unit tests in TestCatalogBackedTableMetadata --------- Co-authored-by: sivabalan <n.siva.b@gmail.com>

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Mar 1, 2026

nsivabalan reviewed Mar 3, 2026

View reviewed changes

suryaprasanna added 4 commits March 22, 2026 14:06

Avoid recursive calls for fetching partitions by using CatalogBackedT…

1842a79

…ableMetadata when metadata is disabled or corrupted Create some unit tests unit tests wip

Remove unnecessary change

070b71e

WIP

4e580e7

Address review comments

5e9dba1

suryaprasanna force-pushed the use-catalog-for-partition-listing-v2 branch from 9c580f6 to 5e9dba1 Compare March 22, 2026 23:49

suryaprasanna and others added 2 commits March 22, 2026 17:45

Address review comments

7d57c12

Addressing feedback and fixing tests

25934b3

Fixing build failure

4fe8df3

nsivabalan approved these changes Apr 3, 2026

View reviewed changes

yihua previously approved these changes Apr 3, 2026

View reviewed changes

yihua reviewed Apr 5, 2026

View reviewed changes

nsivabalan merged commit 3c53b91 into apache:master Apr 6, 2026
56 checks passed


		private static final int DEFAULT_LISTING_PARALLELISM = 1500;

		@Getter

Conversation

suryaprasanna commented Mar 1, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Apr 2, 2026

Uh oh!

hudi-bot commented Apr 3, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 3, 2026

Codecov Report

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

yihua Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants