perf(metadata): avoid recursive calls for partition listing using catalog#18265
Conversation
…ableMetadata when metadata is disabled or corrupted Create some unit tests unit tests wip
9c580f6 to
5e9dba1
Compare
|
Pushed a commit to address minor feedback and fix tests. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18265 +/- ##
============================================
+ Coverage 68.46% 68.75% +0.29%
- Complexity 27472 28080 +608
============================================
Files 2427 2448 +21
Lines 132655 134528 +1873
Branches 15994 16268 +274
============================================
+ Hits 90819 92494 +1675
+ Misses 34786 34734 -52
- Partials 7050 7300 +250
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — One minor formatting issue with method signature indentation in CatalogBackedTableMetadata.scala; otherwise code is clean and readable.
| catalogTable.partitionColumnNames.nonEmpty | ||
| } | ||
|
|
||
| private def shouldUseCatalogPartitions: Boolean = { |
There was a problem hiding this comment.
🤖 nit: consider moving the return type annotation to the same line as the method signature for idiomatic Scala style (e.g., override def getAllPartitionPaths: util.List[String] = or with parentheses if preferred).
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The catalog-backed partition listing approach is a smart optimization that avoids expensive recursive filesystem queries. However, there are a few issues to address before merging: the overridable createMetadataTable() call during base-class construction risks NPEs in subclasses, the catalog lookup needs graceful fallback when a table isn't registered, and the List<Object> partition predicates should use a properly typed interface.
| return metaClient.getTableFormat().getMetadataFactory() | ||
| .create(engineContext, metaClient.getStorage(), metadataConfig, basePath.toString(), true); | ||
| } | ||
|
|
There was a problem hiding this comment.
🤖 Making createMetadataTable protected (overridable) while it's called from doRefresh() at line 201 during the base class constructor is risky. If a subclass overrides this method and accesses subclass-specific fields (e.g., catalog reference, spark session), those fields won't be initialized yet when the base constructor runs. Could you consider using lazy initialization or a post-construction init() method instead?
| Types.RecordType partitionFields, | ||
| Expression expression) throws IOException; | ||
|
|
||
| default List<String> getPartitionPathWithPathPrefixUsingFilterExpression(List<String> relativePathPrefixes, |
There was a problem hiding this comment.
🤖 Using List<Object> for partitionPredicateExpressions loses all type safety. Callers will need to cast blindly, and any mismatch will only surface as a runtime ClassCastException. Could this be a generic type parameter on the interface, or would a more specific type (even a simple wrapper) work here to avoid the raw Object list?
|
|
||
| private static final int DEFAULT_LISTING_PARALLELISM = 1500; | ||
|
|
||
| @Getter |
There was a problem hiding this comment.
🤖 Is getDatabaseName() guaranteed to be non-null here? For tables created without a database name in the table config, this would store null. If the catalog-backed path later uses this database name to look up partitions, that could fail. Have you verified the behavior for tables without an explicit database name?
| } | ||
| private lazy val tableIdentifier = TableIdentifier(catalogTableName, Some(catalogDatabaseName)) | ||
| private lazy val catalogTable = sparkSession.sessionState.catalog.getTableMetadata(tableIdentifier) | ||
|
|
There was a problem hiding this comment.
🤖 If the table isn't registered in the catalog (e.g., created via DataSource API with just a path), getTableMetadata will throw NoSuchTableException. Since catalogTable is accessed by isPartitionedTable and shouldUseCatalogPartitions in every partition listing method, the exception propagates up without any fallback to super (filesystem listing). Could you wrap the catalog access in a Try and fall back to super when the table isn't found?
| private val sparkSession = engineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext.sparkSession | ||
| private val catalogTableName = tableConfig.getTableName | ||
| private lazy val catalogDatabaseName = | ||
| if (StringUtils.isNullOrEmpty(tableConfig.getDatabaseName)) { |
There was a problem hiding this comment.
🤖 When getDatabaseName returns null/empty, this falls back to getCurrentDatabase, which reflects the session's current database context — not necessarily where the table is registered. If a user runs USE some_other_db before querying, this would look up the wrong database and either throw NoSuchTableException or find a different table with the same name. Would it be safer to resolve the database from the catalog table's metadata (e.g., via the table's location path) or at least document this assumption?
| if (!isPartitionedTable) { | ||
| util.Collections.emptyList() | ||
| } else if (shouldUseCatalogPartitions) { | ||
| val partitionPredicateExpressionSeq = partitionPredicateExpressions.asScala.map(_.asInstanceOf[Expression]).toSeq |
There was a problem hiding this comment.
🤖 The else branch falls back to the 3-param super call (without partitionPredicateExpressions), which means the Spark partition predicate expressions are silently dropped when the catalog isn't available. Is this intentional? In that case the filesystem listing only uses the Hudi pushedExpr for filtering, which should still be correct, but I wanted to confirm no filter information is lost here.
…alog (apache#18265) When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance. Summary and Changelog Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces: Added CatalogBackedTableMetadata class that fetches partitions from Spark's external catalog Added "hoodie.datasource.read.file.index.list.partitions.from.catalog" config to enable catalog-based partition listing Modified SparkHoodieTableFileIndex to use catalog-backed metadata when metadata table is not available Added PartitionPathFilterUtil for partition path filtering logic Refactored BaseHoodieTableFileIndex.createMetadataTable() to be overridable Added comprehensive unit tests in TestCatalogBackedTableMetadata --------- Co-authored-by: sivabalan <n.siva.b@gmail.com>
Describe the issue this Pull Request addresses
When metadata table is disabled or corrupted, partition listing operations can result in expensive recursive filesystem queries. This PR introduces a catalog-backed approach to fetch partition information directly from the Spark external catalog, avoiding recursive calls and improving query performance.
Summary and Changelog
Users gain improved performance for partition listing operations when metadata table is unavailable. The change introduces:
Impact
queries
Risk Level
Low - Feature is behind a config flag (disabled by default). Extensive unit tests verify catalog-based partition listing behavior. Fallback to existing filesystem-based approach when config is disabled.
Documentation Update
Config documentation needs to be updated to include the new FILE_INDEX_PARTITION_LISTING_VIA_CATALOG option describing when to enable catalog-based partition listing for performance optimization
Contributor's checklist