[SPARK-57162][SQL] Add nanosecond-aware TimestampFormatter for parsing and formatting TimestampNanosVal#56295
[SPARK-57162][SQL] Add nanosecond-aware TimestampFormatter for parsing and formatting TimestampNanosVal#56295MaxGekk wants to merge 11 commits into
Conversation
…g and formatting TimestampNanosVal ### What changes were proposed in this pull request? Extend the `TimestampFormatter` family with additive, nanosecond-aware parse and format methods that produce and consume `org.apache.spark.unsafe.types.TimestampNanosVal` (`epochMicros: Long` + `nanosWithinMicro: Short` in `[0, 999]`) at a target fractional precision `p` in `[7, 9]`: - New trait methods: `parseNanos` / `parseNanosOptional` (LTZ), `parseWithoutTimeZoneNanos` / `parseWithoutTimeZoneNanosOptional` (NTZ, plus a `final` `allowTimeZone = true` overload), and `formatNanos`. - `Iso8601TimestampFormatter`: `extractNanos` / `extractNanosNTZ` build the `Instant` / `LocalDateTime` and delegate to `SparkDateTimeUtils.instantToTimestampNanos` / `localDateTimeToTimestampNanos`; `formatNanos` floors sub-`precision` digits and renders the reconstructed instant. - `DefaultTimestampFormatter`: delegates to the SPARK-57032 nanos entry points. - `LegacyFastTimestampFormatter` / `LegacySimpleTimestampFormatter`: explicitly reject nanosecond precision under the `LEGACY` time parser policy (they cap at micro resolution). Sub-precision fractional digits are truncated (floored), consistent with SPARK-57032. All existing microsecond methods are unchanged (additive API). ### Why are the changes needed? Today `TimestampFormatter` is microsecond-only and discards the 7th-9th fractional digits. The JSON and CSV datasources drive all timestamp parsing/formatting through `TimestampFormatter`, so they cannot round-trip 7-9 digit fractions until the formatter is nanos-aware. This is the foundational unblocker for nanosecond support in those datasources (parent: SPARK-56822). ### Does this PR introduce _any_ user-facing change? No. The new formatter API is additive and gated for use behind `spark.sql.timestampNanosTypes.enabled` by its callers. ### How was this patch tested? New cases in `TimestampFormatterSuite`: parse/format round-trip for `p` in `[7, 9]` across ISO default and custom patterns (LTZ and NTZ); boundary values (`nanosWithinMicro` 0 and 999, pre-epoch instants, the 0001/1582/1970/9999 corpus); truncation rule; NTZ time-zone rejection; and LEGACY-mode rejection. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8)
…pFormatter Drop the escaped leading dash from @param descriptions, capitalize the first letter, and inline each description on the same `@param` line where it fits.
… the LEGACY parser Replace the INTERNAL_ERROR thrown by the legacy formatters' parseNanos/formatNanos with a user-facing UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER error. The LEGACY time parser policy is user-configurable, so a caller may legitimately combine it with nanosecond-precision timestamps; that should surface as a proper unsupported-feature error suggesting CORRECTED, not an internal error.
…tters to throw the user-facing error Co-authored-by: Max Gekk <max.gekk@gmail.com>
…its no fractional digits Co-authored-by: Max Gekk <max.gekk@gmail.com>
…tTimeZoneNanos Co-authored-by: Max Gekk <max.gekk@gmail.com>
|
@stevomitric Could you review this PR, please. |
…ormatter Reverts the @param reflow that reformatted pre-existing methods' scaladoc, which added unrelated diff noise. Restores the original dashed multi-line @param style consistent with the rest of the file.
…ampFormatter formatNanos always routes through format(Instant) and applies the formatter's zoneId, making it LTZ-only; an NTZ value (a UTC-grid wall clock) rendered under a non-UTC zone came out shifted. Add an NTZ counterpart formatWithoutTimeZoneNanos that mirrors the microsecond format(LocalDateTime) path: it rebuilds the zone-independent local date-time via timestampNanosToLocalDateTime, floors sub-precision digits, and renders with the pattern only (no zoneId). The legacy formatters reject it with the same UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER error, and formatNanos is documented as LTZ-only. Adds a +01:00 regression test that the prior all-UTC NTZ cases could not catch.
…nverters instantToTimestampNanos / localDateTimeToTimestampNanos truncated sub-micro digits via truncateNanosWithinMicroToPrecision, which silently passed through any precision outside [7, 9]. The DefaultTimestampFormatter parse path already rejects out-of-range precision (the string helpers throw), but the Iso8601 path did not, so the contract was enforced inconsistently across subclasses. Make the shared truncation helper raise SparkException.internalError for precision outside [7, 9] so both the NTZ (LocalDateTime) and LTZ (Instant) converters - and thus both formatter subclasses, in parse and format directions - enforce the bound the same way. The check is an internal invariant, not a user-facing validation: precision always originates from a validated TimestampNTZNanosType / TimestampLTZNanosType (constructible only with p in [7, 9]), so it is unreachable from public APIs.
stevomitric
left a comment
There was a problem hiding this comment.
LGTM, thanks for resolving comments.
|
@uros-b Could you review this PR, please. We need this to support casting timestamps to nanos, in literals, text datasource, hive results, in some expressions and so on. |
| /** | ||
| * Optional counterpart of [[parseNanos]]. The result is `None` on invalid input. | ||
| */ | ||
| def parseNanosOptional(s: String, precision: Int): Option[TimestampNanosVal] = | ||
| try { | ||
| Some(parseNanos(s, precision)) | ||
| } catch { | ||
| case _: Exception => None | ||
| } |
There was a problem hiding this comment.
This will swallow the user-facing error on legacy formatters. Do we really want silent nulls instead of errors in such scenarios?
There was a problem hiding this comment.
Good catch - this is a real inconsistency. The legacy formatters override the strict parseNanos / parseWithoutTimeZoneNanos to throw the user-facing UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER, but the *Optional variants inherited the trait default, which catches case _: Exception => None. Since legacyNanosUnsupported() is a (non-fatal) SparkUnsupportedOperationException, the optional path swallowed it and returned None - so under the LEGACY policy the strict path errored while the optional path silently yielded null. That conflates genuinely invalid input (legitimately None) with an unsupported config combination (LEGACY + nanos), which should surface loudly so the user knows to switch to CORRECTED.
Rather than re-add a throwing override to every formatter and rely on remembering it for future ones, I removed the defaults entirely: parseNanosOptional and parseWithoutTimeZoneNanosOptional are now abstract in the (sealed) TimestampFormatter trait, so every implementer must consciously decide. The legacy formatters implement them by throwing legacyNanosUnsupported(); the Default / Iso8601 formatters already implement them with the None-on-invalid-input behavior (their parseNanos doesn't throw the unsupported error, so that's correct there). Because the trait is sealed, the compiler guarantees completeness file-locally with no external burden. Extended the LEGACY-policy test to assert both optional variants raise the error instead of returning None.
uros-b
left a comment
There was a problem hiding this comment.
Left one comment, otherwise LGTM!
…acy formatters surface the unsupported error The legacy formatters override the strict `parseNanos` / `parseWithoutTimeZoneNanos` to throw the user-facing UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_WITH_LEGACY_TIME_PARSER error, but inherited the trait default for `parseNanosOptional` / `parseWithoutTimeZoneNanosOptional`, which catches `case _: Exception => None`. Since `legacyNanosUnsupported()` is a non-fatal SparkUnsupportedOperationException, the optional path swallowed it and returned None - so under the LEGACY time parser policy the strict path errored while the optional path silently yielded null, conflating invalid input with an unsupported config combination. Remove the swallowing defaults: make `parseNanosOptional` and `parseWithoutTimeZoneNanosOptional` abstract in the sealed `TimestampFormatter` trait so every implementer must decide explicitly. The legacy formatters implement them by throwing `legacyNanosUnsupported()`; the Default/Iso8601 formatters already implement the None-on-invalid-input behavior. Extend the LEGACY-policy test to assert both optional variants raise the error instead of returning None.
What changes were proposed in this pull request?
Extend the
TimestampFormatterfamily with additive, nanosecond-aware parse and format methods that produce and consumeorg.apache.spark.unsafe.types.TimestampNanosVal(epochMicros: Long+nanosWithinMicro: Shortin[0, 999]) at a target fractional precisionpin[7, 9]:parseNanos/parseNanosOptional(LTZ),parseWithoutTimeZoneNanos/parseWithoutTimeZoneNanosOptional(NTZ, plus afinalallowTimeZone = trueoverload), andformatNanos.Iso8601TimestampFormatter:extractNanos/extractNanosNTZbuild theInstant/LocalDateTimeand delegate toSparkDateTimeUtils.instantToTimestampNanos/localDateTimeToTimestampNanos;formatNanosfloors sub-precisiondigits and renders the reconstructed instant.DefaultTimestampFormatter: delegates to the SPARK-57032 nanos entry points.LegacyFastTimestampFormatter/LegacySimpleTimestampFormatter: explicitly reject nanosecond precision under theLEGACYtime parser policy (they cap at micro resolution).Sub-precision fractional digits are truncated (floored), consistent with SPARK-57032. All existing microsecond methods are unchanged (additive API).
Why are the changes needed?
Today
TimestampFormatteris microsecond-only and discards the 7th-9th fractional digits. The JSON and CSV datasources drive all timestamp parsing/formatting throughTimestampFormatter, so they cannot round-trip 7-9 digit fractions until the formatter is nanos-aware. This is the foundational unblocker for nanosecond support in those datasources (parent: SPARK-56822).Does this PR introduce any user-facing change?
No. The new formatter API is additive and gated for use behind
spark.sql.timestampNanosTypes.enabledby its callers.How was this patch tested?
New cases in
TimestampFormatterSuite: parse/format round-trip forpin[7, 9]across ISO default and custom patterns (LTZ and NTZ); boundary values (nanosWithinMicro0 and 999, pre-epoch instants, the 0001/1582/1970/9999 corpus); truncation rule; NTZ time-zone rejection; and LEGACY-mode rejection.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)