HIVE-28676: Support custom partition patterns in MSCK repair table#6271
HIVE-28676: Support custom partition patterns in MSCK repair table#6271VenkatSNarayanan wants to merge 2 commits intoapache:masterfrom
Conversation
Adds support for custom partitioning patterns to MSCK repair table.
|
|
@deniskuzZ Addressed all the review comments from the old PR. The hcat.dynamic.partitioning.custom.pattern setting needs to be used in hive-standalone-metastore-server in one place, so I moved the string with the name of the setting there(hcatalog-core and hive-exec depend on that module). I've rebased it on current master and rerun the MSCK unit tests, which all pass. |
|
hi @VenkatSNarayanan, could you please share old PR to get the context. #5591? |
|
Old PR: #5591 |
There was a problem hiding this comment.
Pull request overview
This pull request adds support for custom partitioning patterns to the MSCK (Metastore Check) repair table functionality in Hive. Previously, only HCatStorer supported custom partitioning patterns for dynamic partitioning, but the MSCK repair table command would fail when encountering non-standard partition directory layouts. This change enables MSCK to recognize and extract partition values from custom path patterns like const/${state}/const2/${id}-${city} instead of only the standard key=value format.
Changes:
- Introduced
DynamicPartitioningCustomPatternclass to parse custom partition patterns and generate regex matchers - Modified
MetaStoreServerUtils.getPartitionName()to support custom pattern-based partition name extraction - Updated
HiveMetaStoreCheckerto validate paths against custom patterns during directory traversal - Refactored the
HCAT_CUSTOM_DYNAMIC_PATTERNconstant fromHCatConstantstoMetaStoreServerUtilsfor broader accessibility - Enhanced
AlterTableAddPartitionDescand related classes to propagate custom patterns through the partition creation pipeline - Added test coverage in
TestHiveMetaStoreCheckerfor custom partition pattern validation
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| DynamicPartitioningCustomPattern.java | New class that parses custom patterns, extracts partition column names, and builds regex matchers for path validation |
| MetaStoreServerUtils.java | Added custom pattern handling to getPartitionName() method and defined the HCAT_CUSTOM_DYNAMIC_PATTERN constant |
| HiveMetaStoreChecker.java | Modified PathDepthInfoCallable to support custom pattern depth calculation and path validation |
| TestHiveMetaStoreChecker.java | Added test case for custom partition pattern with non-standard directory structure |
| Partition.java | Enhanced createMetaPartitionObject() to construct partition paths from custom patterns |
| AlterTableAddPartitionDesc.java | Added custom pattern field and constructor overload for backward compatibility |
| AlterTableAddPartitionOperation.java | Updated to pass custom pattern to partition creation |
| AbstractAddPartitionAnalyzer.java | Modified to extract and propagate custom pattern from configuration |
| ImportSemanticAnalyzer.java | Updated to include custom pattern when creating partition descriptors |
| FSTableEvent.java | Modified replication flow to propagate custom patterns |
| HCatOutputFormat.java | Updated import to use new constant location |
| HCatFileUtil.java | Updated to import shared customPathPattern from DynamicPartitioningCustomPattern |
| HCatConstants.java | Removed HCAT_DYNAMIC_CUSTOM_PATTERN constant (moved to MetaStoreServerUtils) |
| TestHCatDynamicPartitioned.java | Updated to use new constant location |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| boolean didMatch = pathMatcher.matches(); | ||
| if (!didMatch) { //partition path doesn't match the pattern, should have been detected at an earlier step | ||
| throw new MetastoreException("Path " + relPath + "doesn't match custom partition pattern " | ||
| + customPathPattern + "partitionPathFull: " + partitionPath); |
There was a problem hiding this comment.
The error message is missing proper spacing. It should include a space before "partitionPathFull:" for better readability.
| + customPathPattern + "partitionPathFull: " + partitionPath); | |
| + customPathPattern + " partitionPathFull: " + partitionPath); |
| if (!patternPartCols.isEmpty()) { | ||
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | ||
| } | ||
| for (int i = 1; i < patternPartCols.size(); i++) { | ||
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); |
There was a problem hiding this comment.
The partition name is constructed using the order of columns as they appear in the custom pattern, but Hive expects partition names to follow the order defined in the table's partition columns. For example, if the table has partition columns (partcity, partstate, partid) but the custom pattern is "const/${partstate}/const2/${partid}-${partcity}", the code creates "partstate=CO/partid=1/partcity=denver" instead of the required "partcity=denver/partstate=CO/partid=1". The extracted values should be reordered to match the table's partition column order before constructing the partition name.
| if (!patternPartCols.isEmpty()) { | |
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); | |
| // Map captured values by column name as they appear in the pattern. | |
| Map<String, String> capturedValues = new HashMap<>(); | |
| for (int i = 0; i < patternPartCols.size(); i++) { | |
| String colName = patternPartCols.get(i); | |
| String value = pathMatcher.group(i + 1); | |
| capturedValues.put(colName, value); | |
| } | |
| // Rebuild the partition name in the order of the table's partition columns. | |
| for (String colName : partCols) { | |
| String value = capturedValues.get(colName); | |
| if (value == null) { | |
| continue; | |
| } | |
| if (result == null) { | |
| result = new StringBuilder(colName + "=" + value); | |
| } else { | |
| result.append(Path.SEPARATOR).append(colName).append("=").append(value); | |
| } |
| StringBuffer sb = new StringBuffer(); | ||
| Matcher m = customPathPattern.matcher(customPattern); | ||
| List<String> partColumns = new ArrayList<>(); | ||
| while (m.find()) { | ||
| m.appendReplacement(sb, "(.*)"); | ||
| partColumns.add(m.group(2)); | ||
| } | ||
| m.appendTail(sb); | ||
| return new DynamicPartitioningCustomPattern(customPattern, Pattern.compile(sb.toString()), partColumns); |
There was a problem hiding this comment.
The custom pattern is compiled as a regex without escaping special regex characters. If the pattern contains characters like ., *, +, ?, etc., they will be interpreted as regex metacharacters rather than literals. For example, const./${partstate} would match constX/CO where X is any character, not just const./CO. Consider using Pattern.quote() to escape the literal portions of the pattern between the ${} placeholders, or document that the pattern supports regex metacharacters.
| StringBuffer sb = new StringBuffer(); | |
| Matcher m = customPathPattern.matcher(customPattern); | |
| List<String> partColumns = new ArrayList<>(); | |
| while (m.find()) { | |
| m.appendReplacement(sb, "(.*)"); | |
| partColumns.add(m.group(2)); | |
| } | |
| m.appendTail(sb); | |
| return new DynamicPartitioningCustomPattern(customPattern, Pattern.compile(sb.toString()), partColumns); | |
| StringBuilder sb = new StringBuilder(); | |
| Matcher m = customPathPattern.matcher(customPattern); | |
| List<String> partColumns = new ArrayList<>(); | |
| int lastEnd = 0; | |
| while (m.find()) { | |
| // Append the literal text before this placeholder, escaped so regex metacharacters are treated literally. | |
| if (m.start() > lastEnd) { | |
| sb.append(Pattern.quote(customPattern.substring(lastEnd, m.start()))); | |
| } | |
| // Replace the placeholder itself with a capturing group to extract the partition value. | |
| sb.append("(.*)"); | |
| partColumns.add(m.group(2)); | |
| lastEnd = m.end(); | |
| } | |
| // Append any remaining literal text after the last placeholder. | |
| if (lastEnd < customPattern.length()) { | |
| sb.append(Pattern.quote(customPattern.substring(lastEnd))); | |
| } | |
| Pattern partitionCapturePattern = Pattern.compile(sb.toString()); | |
| return new DynamicPartitioningCustomPattern(customPattern, partitionCapturePattern, partColumns); |
| if (!patternPartCols.isEmpty()) { | ||
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | ||
| } | ||
| for (int i = 1; i < patternPartCols.size(); i++) { | ||
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); |
There was a problem hiding this comment.
If the custom pattern contains no partition column placeholders (empty patternPartCols), the result will be null, which is returned at line 1676. However, this case should be validated earlier and an appropriate error should be thrown, as a partition pattern without any partition columns doesn't make sense. Consider adding validation in the DynamicPartitioningCustomPattern.Builder.build() method to ensure at least one partition column is present.
| if (!patternPartCols.isEmpty()) { | |
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); | |
| if (patternPartCols.isEmpty()) { | |
| throw new MetastoreException("Custom dynamic partition pattern must contain at least one partition column: " | |
| + customPattern); | |
| } | |
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i + 1)); |
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | ||
| } | ||
| for (int i = 1; i < patternPartCols.size(); i++) { | ||
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); |
There was a problem hiding this comment.
The custom pattern path does not call getNormalisedPartitionValue for the extracted partition values, unlike the standard path (lines 1660-1661). This means numeric partition values won't be normalized (e.g., "01" vs "1") when using custom patterns, leading to inconsistent behavior. The extracted partition values should be normalized using getNormalisedPartitionValue(pathMatcher.group(i+1), partitionColToTypeMap.get(patternPartCols.get(i)), conf) before being appended to the result.
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); | |
| String firstColName = patternPartCols.get(0); | |
| String firstRawValue = pathMatcher.group(1); | |
| String firstNormalisedValue = getNormalisedPartitionValue( | |
| firstRawValue, partitionColToTypeMap.get(firstColName), conf); | |
| if (firstNormalisedValue == null) { | |
| return null; | |
| } | |
| result = new StringBuilder(firstColName + "=" + firstNormalisedValue); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| String colName = patternPartCols.get(i); | |
| String rawValue = pathMatcher.group(i + 1); | |
| String normalisedValue = getNormalisedPartitionValue( | |
| rawValue, partitionColToTypeMap.get(colName), conf); | |
| if (normalisedValue == null) { | |
| return null; | |
| } | |
| result.append(Path.SEPARATOR).append(colName).append("=").append(normalisedValue); |
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | ||
| } | ||
| for (int i = 1; i < patternPartCols.size(); i++) { | ||
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); |
There was a problem hiding this comment.
Partition column names extracted from the custom pattern are not converted to lowercase, unlike the standard path (line 1656). Hive stores partition keys in lowercase, so the extracted column names from patternPartCols.get(i) should be lowercased before being used in the result to maintain consistency with the standard partition name handling.
| result = new StringBuilder(patternPartCols.get(0) + "=" + pathMatcher.group(1)); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| result.append(Path.SEPARATOR).append(patternPartCols.get(i)).append("=").append(pathMatcher.group(i+1)); | |
| String partitionName = patternPartCols.get(0).toLowerCase(); | |
| result = new StringBuilder(partitionName + "=" + pathMatcher.group(1)); | |
| } | |
| for (int i = 1; i < patternPartCols.size(); i++) { | |
| String partitionName = patternPartCols.get(i).toLowerCase(); | |
| result.append(Path.SEPARATOR).append(partitionName).append("=").append(pathMatcher.group(i + 1)); |
| Matcher m = customPathPattern.matcher(customPattern); | ||
| List<String> partColumns = new ArrayList<>(); | ||
| while (m.find()) { | ||
| m.appendReplacement(sb, "(.*)"); |
There was a problem hiding this comment.
The greedy regex pattern (.*) can lead to performance issues and unexpected matching behavior. Consider using a non-greedy pattern (.*?) or, better yet, ([^/]+) to match partition values that don't contain path separators. The greedy pattern requires excessive backtracking when the regex engine tries to match subsequent literals in the pattern, and could incorrectly match values that span multiple path segments.
| m.appendReplacement(sb, "(.*)"); | |
| m.appendReplacement(sb, "([^/]+)"); |
| Matcher m = customPathPattern.matcher(customPattern); | ||
| StringBuffer sb = new StringBuffer(); | ||
| while (m.find()) { | ||
| m.appendReplacement(sb, partSpec.get(m.group(2))); |
There was a problem hiding this comment.
The code partSpec.get(m.group(2)) can return null if the partition spec doesn't contain a value for the column name extracted from the pattern. This would result in appendReplacement inserting "null" as a string into the path. Add a null check and throw an appropriate exception if a required partition column is missing from the spec.
| m.appendReplacement(sb, partSpec.get(m.group(2))); | |
| String partColName = m.group(2); | |
| String partValue = partSpec.get(partColName); | |
| if (partValue == null) { | |
| throw new HiveException("partition spec is invalid; required partition column '" | |
| + partColName + "' is missing for custom pattern: " + customPattern); | |
| } | |
| m.appendReplacement(sb, partValue); |
| if (customPattern != null) { | ||
| DynamicPartitioningCustomPattern compiledCustomPattern = new DynamicPartitioningCustomPattern.Builder() | ||
| .setCustomPattern(customPattern) | ||
| .build(); | ||
| Pattern customPathPattern = compiledCustomPattern.getPartitionCapturePattern(); | ||
| //partition columns in order that they appear in the pattern | ||
| List<String> patternPartCols = compiledCustomPattern.getPartitionColumns(); |
There was a problem hiding this comment.
There is no validation that the partition columns extracted from the custom pattern match the table's partition columns (provided in partCols). The code should verify that: (1) all columns in patternPartCols exist in the table's partition columns, and (2) the number of columns matches. Without this validation, the code could create invalid partition names with columns that don't exist in the table definition or with mismatched column counts.
| String relPath = String.join(Path.SEPARATOR, | ||
| Arrays.copyOfRange(parts, parts.length - maxDepth, parts.length)); | ||
| if (!fullPattern.matcher(relPath).matches()) { | ||
| logOrThrowExceptionWithMsg("File path " + currentPath + " does not match custom partitioning pattern" |
There was a problem hiding this comment.
The error message is missing a space between the pattern and "partitionPathFull". It should be "does not match custom partitioning pattern " instead of "does not match custom partitioning pattern" to improve readability.
| logOrThrowExceptionWithMsg("File path " + currentPath + " does not match custom partitioning pattern" | |
| logOrThrowExceptionWithMsg("File path " + currentPath + " does not match custom partitioning pattern " |



Adds support for custom partitioning patterns to MSCK repair table.
What changes were proposed in this pull request?
Adds support for custom partitioning patterns to MSCK repair table.
Why are the changes needed?
HCatStorer supports custom partitioning patterns when using dynamic partitioning, but Hive itself does not support this. This change adds support for non-pathological cases to Hive.
Does this PR introduce any user-facing change?
MSCK repair table with a configured custom partition pattern would previously ignore that pattern and error on finding nonstandard paths. With the code from this PR, it will respect the defined custom pattern when it extracts partition key values from the paths.
Is the change a dependency upgrade?
No
How was this patch tested?
A test was added to TestHiveMetastoreChecker to test the common kinds of custom patterns supported.
Adds support for custom partitioning patterns to MSCK repair table.