Skip to content

perf(metadata): avoid recursive calls for partition listing using catalog#18265

Merged
nsivabalan merged 7 commits intoapache:masterfrom
suryaprasanna:use-catalog-for-partition-listing-v2
Apr 6, 2026
Merged

perf(metadata): avoid recursive calls for partition listing using catalog#18265
nsivabalan merged 7 commits intoapache:masterfrom
suryaprasanna:use-catalog-for-partition-listing-v2

Conversation

@suryaprasanna
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance.

Summary and Changelog

Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces:

  • Added CatalogBackedTableMetadata class that fetches partitions from Spark's external catalog
  • Added FILE_INDEX_PARTITION_LISTING_VIA_CATALOG config to enable catalog-based partition listing
  • Modified SparkHoodieTableFileIndex to use catalog-backed metadata when metadata table is not available
  • Added PartitionPathFilterUtil for partition path filtering logic
  • Refactored BaseHoodieTableFileIndex.createMetadataTable() to be overridable
  • Added comprehensive unit tests in TestCatalogBackedTableMetadata

Impact

  • Performance: Reduced latency for partition listing when metadata table is disabled by avoiding recursive filesystem
    queries
  • API Change: Added new config option FILE_INDEX_PARTITION_LISTING_VIA_CATALOG (default: false)
  • Behavior: When enabled and metadata table is unavailable, partitions are fetched from catalog instead of filesystem

Risk Level

Low - Feature is behind a config flag (disabled by default). Extensive unit tests verify catalog-based partition listing behavior. Fallback to existing filesystem-based approach when config is disabled.

Documentation Update

Config documentation needs to be updated to include the new FILE_INDEX_PARTITION_LISTING_VIA_CATALOG option describing when to enable catalog-based partition listing for performance optimization

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Mar 1, 2026
@suryaprasanna suryaprasanna force-pushed the use-catalog-for-partition-listing-v2 branch from 9c580f6 to 5e9dba1 Compare March 22, 2026 23:49
@nsivabalan
Copy link
Copy Markdown
Contributor

Pushed a commit to address minor feedback and fix tests.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 3, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.62025% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.75%. Comparing base (3aef2ca) to head (4fe8df3).
⚠️ Report is 43 commits behind head on master.

Files with missing lines Patch % Lines
...che/hudi/metadata/CatalogBackedTableMetadata.scala 52.27% 12 Missing and 9 partials ⚠️
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala 85.71% 0 Missing and 2 partials ⚠️
.../org/apache/hudi/util/PartitionPathFilterUtil.java 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18265      +/-   ##
============================================
+ Coverage     68.46%   68.75%   +0.29%     
- Complexity    27472    28080     +608     
============================================
  Files          2427     2448      +21     
  Lines        132655   134528    +1873     
  Branches      15994    16268     +274     
============================================
+ Hits          90819    92494    +1675     
+ Misses        34786    34734      -52     
- Partials       7050     7300     +250     
Flag Coverage Δ
common-and-other-modules 44.49% <24.35%> (+0.05%) ⬆️
hadoop-mr-java-client 44.85% <45.45%> (-0.15%) ⬇️
spark-client-hadoop-common 48.49% <63.63%> (+0.28%) ⬆️
spark-java-tests 48.76% <30.37%> (-0.07%) ⬇️
spark-scala-tests 45.62% <67.08%> (+0.66%) ⬆️
utilities 38.34% <21.79%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/BaseHoodieTableFileIndex.java 83.41% <100.00%> (-0.08%) ⬇️
...e/hudi/metadata/FileSystemBackedTableMetadata.java 85.15% <100.00%> (+0.73%) ⬆️
...pache/hudi/metadata/HoodieBackedTableMetadata.java 82.56% <100.00%> (-0.23%) ⬇️
.../org/apache/hudi/metadata/HoodieTableMetadata.java 76.19% <100.00%> (-2.76%) ⬇️
...main/scala/org/apache/hudi/DataSourceOptions.scala 95.49% <100.00%> (+0.09%) ⬆️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala 83.08% <100.00%> (+0.32%) ⬆️
.../org/apache/hudi/util/PartitionPathFilterUtil.java 75.00% <75.00%> (ø)
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala 72.48% <85.71%> (+0.21%) ⬆️
...che/hudi/metadata/CatalogBackedTableMetadata.scala 52.27% <52.27%> (ø)

... and 162 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yihua
yihua previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@yihua yihua dismissed their stale review April 5, 2026 04:37

Accidentally approved

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — One minor formatting issue with method signature indentation in CatalogBackedTableMetadata.scala; otherwise code is clean and readable.

catalogTable.partitionColumnNames.nonEmpty
}

private def shouldUseCatalogPartitions: Boolean = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: consider moving the return type annotation to the same line as the method signature for idiomatic Scala style (e.g., override def getAllPartitionPaths: util.List[String] = or with parentheses if preferred).

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The catalog-backed partition listing approach is a smart optimization that avoids expensive recursive filesystem queries. However, there are a few issues to address before merging: the overridable createMetadataTable() call during base-class construction risks NPEs in subclasses, the catalog lookup needs graceful fallback when a table isn't registered, and the List<Object> partition predicates should use a properly typed interface.

return metaClient.getTableFormat().getMetadataFactory()
.create(engineContext, metaClient.getStorage(), metadataConfig, basePath.toString(), true);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Making createMetadataTable protected (overridable) while it's called from doRefresh() at line 201 during the base class constructor is risky. If a subclass overrides this method and accesses subclass-specific fields (e.g., catalog reference, spark session), those fields won't be initialized yet when the base constructor runs. Could you consider using lazy initialization or a post-construction init() method instead?

Types.RecordType partitionFields,
Expression expression) throws IOException;

default List<String> getPartitionPathWithPathPrefixUsingFilterExpression(List<String> relativePathPrefixes,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Using List<Object> for partitionPredicateExpressions loses all type safety. Callers will need to cast blindly, and any mismatch will only surface as a runtime ClassCastException. Could this be a generic type parameter on the interface, or would a more specific type (even a simple wrapper) work here to avoid the raw Object list?


private static final int DEFAULT_LISTING_PARALLELISM = 1500;

@Getter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Is getDatabaseName() guaranteed to be non-null here? For tables created without a database name in the table config, this would store null. If the catalog-backed path later uses this database name to look up partitions, that could fail. Have you verified the behavior for tables without an explicit database name?

}
private lazy val tableIdentifier = TableIdentifier(catalogTableName, Some(catalogDatabaseName))
private lazy val catalogTable = sparkSession.sessionState.catalog.getTableMetadata(tableIdentifier)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 If the table isn't registered in the catalog (e.g., created via DataSource API with just a path), getTableMetadata will throw NoSuchTableException. Since catalogTable is accessed by isPartitionedTable and shouldUseCatalogPartitions in every partition listing method, the exception propagates up without any fallback to super (filesystem listing). Could you wrap the catalog access in a Try and fall back to super when the table isn't found?

private val sparkSession = engineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext.sparkSession
private val catalogTableName = tableConfig.getTableName
private lazy val catalogDatabaseName =
if (StringUtils.isNullOrEmpty(tableConfig.getDatabaseName)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 When getDatabaseName returns null/empty, this falls back to getCurrentDatabase, which reflects the session's current database context — not necessarily where the table is registered. If a user runs USE some_other_db before querying, this would look up the wrong database and either throw NoSuchTableException or find a different table with the same name. Would it be safer to resolve the database from the catalog table's metadata (e.g., via the table's location path) or at least document this assumption?

if (!isPartitionedTable) {
util.Collections.emptyList()
} else if (shouldUseCatalogPartitions) {
val partitionPredicateExpressionSeq = partitionPredicateExpressions.asScala.map(_.asInstanceOf[Expression]).toSeq
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The else branch falls back to the 3-param super call (without partitionPredicateExpressions), which means the Spark partition predicate expressions are silently dropped when the catalog isn't available. Is this intentional? In that case the filesystem listing only uses the Hudi pushedExpr for filtering, which should still be correct, but I wanted to confirm no filter information is lost here.

@nsivabalan nsivabalan merged commit 3c53b91 into apache:master Apr 6, 2026
56 checks passed
rahil-c pushed a commit to rahil-c/hudi that referenced this pull request Apr 20, 2026
…alog (apache#18265)

When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance.

Summary and Changelog
Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces:

Added CatalogBackedTableMetadata class that fetches partitions from Spark's external catalog
Added "hoodie.datasource.read.file.index.list.partitions.from.catalog" config to enable catalog-based partition listing
Modified SparkHoodieTableFileIndex to use catalog-backed metadata when metadata table is not available
Added PartitionPathFilterUtil for partition path filtering logic
Refactored BaseHoodieTableFileIndex.createMetadataTable() to be overridable
Added comprehensive unit tests in TestCatalogBackedTableMetadata

---------

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants