[SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework#56266
[SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework#56266MaxGekk wants to merge 8 commits into
Conversation
|
Please, review this PR @davidm-db @stevomitric @dejankrak-db |
stevomitric
left a comment
There was a problem hiding this comment.
LGTM, left a few comments, please resolve.
|
A general note for reviewers: the dual logic (framework + legacy) is only present for the |
87cd8a6 to
d2f70e6
Compare
…ramework via TypeOps overrides Registers TimestampNTZNanosType and TimestampLTZNanosType in the Types Framework by adding TimestampNanosTypeOps (server-side, catalyst) and TimestampNanosTypeApiOps (client-side, sql/api) with their NTZ/LTZ concrete subclasses. These only override existing TypeOps/TypeApiOps methods (physical type, literals, row writer/accessor, mutable value, codegen class, formatting); no new framework methods are introduced. A dedicated MutableTimestampNanos holder is added to SpecificInternalRow to avoid the MutableAny fallback. Encoders remain out of scope (SPARK-57033), so getEncoder reports the type as unsupported to match the legacy RowEncoder behavior.
…tamp formatting Replace the interim TimestampNanosVal.toString passthrough in format() with SparkException.internalError so callers see a clear message instead of a debug string on the CAST-to-STRING path. Update the test to expect the error. Co-authored-by: Isaac
…ypeOps to reuse SPARK-57033 The Types Framework overrides for the nanosecond timestamp types short-circuit the legacy dispatch when spark.sql.types.framework.enabled is on. They were written against the pre-SPARK-57033 world, so enabling the framework regressed the encoder and java.time conversion support that PR apache#56158 already merged: - getEncoder threw UNSUPPORTED_DATA_TYPE_FOR_ENCODER instead of returning the LocalDateTimeNanosEncoder / InstantNanosEncoder added by SPARK-57033. - toCatalystImpl/toScala/toScalaImpl were identity passthroughs instead of converting java.time.LocalDateTime/Instant <-> TimestampNanosVal with per-precision truncation, as the legacy TimestampN(TZ|LTZ)NanosConverter does. Wire getEncoder (gated by spark.sql.timestampNanosTypes.enabled, mirroring the legacy RowEncoder path) to the precision-aware encoders, and route the catalyst conversions through the shared DateTimeUtils helpers (raising INVALID_EXTERNAL_VALUE on bad input). Add tests asserting the encoder/converter parity between the framework-on and framework-off paths and that the feature-flag gate still holds. Co-authored-by: Isaac
…ST-to-string behavior The Types Framework TypeApiOps.format for nanosecond timestamp types threw an internal "not yet implemented" error, which diverged from the framework-off path: ToStringBase renders the internal value via UTF8String.fromString(toString) (interpreted) / String.valueOf (codegen). That made CAST-to-string and display output differ depending on spark.sql.types.framework.enabled. Have format() mirror the legacy fallback by rendering the internal TimestampNanosVal via toString, so enabling the Types Framework does not change CAST-to-string / display output until dedicated fractional-second formatters land. Update the test to assert format/toSQLValue follow this behavior. Co-authored-by: Isaac
… Framework Route all nanosecond timestamp (TimestampNTZNanosType / TimestampLTZNanosType) handling through TypeOps / TypeApiOps and remove the legacy duplicate code paths, so these types are supported only when the Types Framework is enabled and are unsupported (no silent fallback) when it is disabled. - Move createSerializer / createDeserializer into TimestampNanosTypeOps and add a getBoxedJavaClass hook on TypeOps (overridden by TimeTypeOps) so EncoderUtils, Serializer/DeserializerBuildHelper, RowEncoder, CatalystTypeConverters, PhysicalDataType, Literal, CodeGenerator and InternalRow no longer special-case nanos types. - Make string formatting raise an internal error until a dedicated TimestampFormatter lands: TimestampNanosTypeApiOps.format / toSQLValue throw, and the codegen CAST-to-string path in ToStringBase throws the same error so interpreted and codegen behavior stay consistent. - Update the nanos suites to assert framework-only behavior and that CAST-to-string fails the same way in both eval modes.
…s timestamp types Validate via checkValue on spark.sql.timestampNanosTypes.enabled that it can only be set to true when spark.sql.types.framework.enabled is true, since the nanosecond timestamp types are implemented solely through the Types Framework. Switch the entry to createWithDefaultFunction so the validator runs only on explicitly-set values, not on the default during SQLConf initialization.
…mestamps as strings Converting a nanosecond timestamp value to a string (CAST to STRING, EXPLAIN/SHOW output, SQL-literal rendering) is reachable from public APIs once the preview flags are enabled, so the previous SparkException.internalError was inappropriate. Introduce the UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING error condition and raise it from both the interpreted (TimestampNanosTypeApiOps.format) and codegen (ToStringBase.castToStringCode) paths via a shared DataTypeErrors helper.
d2f70e6 to
9ac8e3a
Compare
davidm-db
left a comment
There was a problem hiding this comment.
Minor, on the description: the "What changes were proposed" section says this PR adds three hooks to the TypeOps base - getBoxedJavaClass, createSerializer, and createDeserializer. Only getBoxedJavaClass is actually new here; createSerializer and createDeserializer already exist on the base trait (added in #54905), and this PR overrides them on the nanos ops. Might be worth rewording so reviewers don't go looking for base-trait additions that aren't in the diff.
| @scala.annotation.tailrec | ||
| def javaBoxedType(dt: DataType): Class[_] = dt match { | ||
| def javaBoxedType(dt: DataType): Class[_] = | ||
| TypeOps(dt).map(_.getBoxedJavaClass).getOrElse(javaBoxedTypeDefault(dt)) |
There was a problem hiding this comment.
Question (non-blocking): is getBoxedJavaClass reachable for the framework types today? javaBoxedType has two callers - ValidateExternalType.checkType and MapObjects.elementClassTag - and for non-native encoders both receive the external ObjectType(...) (e.g. ObjectType(LocalTime) / ObjectType(LocalDateTime)), not the raw TimestampNTZNanosType / TimeType; MapObjects's boxed-element branch only fires for native catalyst element types. So TypeOps(dt).map(_.getBoxedJavaClass) appears to return Some only for types that never reach javaBoxedType with their literal form. If that's right, the hook plus this wiring is effectively defensive for now - which is fine and consistent with centralizing the dispatch (the explicit nanos arms it replaces were equally unreached) - I just want to confirm it's intentional rather than closing a live gap. The hook is still the right shape if it ever does become reachable, since getJavaClass returns the primitive class for Long-backed types.
There was a problem hiding this comment.
Claude flagged this, I wasn't able to disprove it - so this is more of a question for my knowledge...
There was a problem hiding this comment.
Your analysis is correct, and it's intentional/defensive rather than closing a live gap. Both callers of javaBoxedType (ValidateExternalType.checkType, MapObjects.elementClassTag) receive the external ObjectType(...) for non-native encoders, and MapObjects's boxed-element branch only fires for native catalyst element types - so TypeOps(dt).map(_.getBoxedJavaClass) returns Some only for dataTypes that don't currently reach javaBoxedType in their raw form. It directly replaces the explicit case _: TimestampNTZNanosType => classOf[TimestampNanosVal] arms that the old javaBoxedType had, which were equally unreached. I kept it because it centralizes the dispatch consistently with the rest of the framework and is the correct shape if a primitive-backed framework type ever does reach javaBoxedType (getBoxedJavaClass returns the boxed class, getJavaClass the primitive).
stevomitric
left a comment
There was a problem hiding this comment.
Wrapper encoders can be bypassed by the new TypeOps serializer/deserializer dispatch. In PR head, SerializerBuildHelper.createSerializer and DeserializerBuildHelper.createDeserializer call TypeOps(enc.dataType) before matching wrapper encoders. But OptionEncoder and TransformingEncoder proxy dataType to the wrapped encoder, so OptionEncoder(LocalDateTimeNanosEncoder(p)).dataType == TimestampNTZNanosType(p). That means the nanos TypeOps path can try to serialize an Option[LocalDateTime] as if it were a bare LocalDateTime, and deserialize to LocalDateTime without WrapOption. Fix by unwrapping OptionEncoder / TransformingEncoder before framework dispatch, and add tests for optional/nested nanos encoder roundtrips.
…rk serde dispatch SerializerBuildHelper.createSerializer and DeserializerBuildHelper.createDeserializer dispatch to TypeOps(enc.dataType) before matching wrapper encoders. OptionEncoder and TransformingEncoder proxy dataType to the wrapped encoder, so e.g. OptionEncoder(LocalDateTimeNanosEncoder(p)).dataType == TimestampNTZNanosType(p), which let the framework leaf serializer/deserializer fire on the wrapper and skip UnwrapOption/WrapOption (and the transforming codec). Route OptionEncoder and TransformingEncoder through the default path first so the framework leaf dispatch only sees unwrapped leaf encoders. Add a roundtrip test for Option-wrapped nanos encoders.
|
Thanks for the review @davidm-db @stevomitric. Addressing the two review-summary comments: @davidm-db (description wording). You're right - only @stevomitric (wrapper encoders bypassing the framework serde dispatch). Good catch. Since Fixed in 02db736: For context: this dispatch ordering predates this PR (the helpers have routed through |
What changes were proposed in this pull request?
This PR wires
TimestampNTZNanosType(p)andTimestampLTZNanosType(p)(p in [7, 9]) through the Spark SQL Types Framework (SPARK-53504), so that all type-specific behavior for the nanosecond timestamp types is centralized behindTypeOps/TypeApiOps. The nanos types are now supported only through the framework: the scattered legacy dispatch for them is removed.Concretely:
TimestampNanosTypeOps(catalyst) withTimestampNTZNanosTypeOps/TimestampLTZNanosTypeOps, registered inTypeOps.apply(). Overrides:getPhysicalType,getJavaClass,getBoxedJavaClass,getRowWriter,getDefaultLiteral,getJavaLiteral,getMutableValue,toCatalystImpl,toScala/toScalaImpl,createSerializer,createDeserializer.getBoxedJavaClasshook to theTypeOpsbase (the boxed Java class used in codegen). ThecreateSerializer/createDeserializerhooks already exist on the base trait (used byTimeTypeOps); the nanos ops above only override them.TimestampNanosTypeApiOps(sql/api) with NTZ/LTZ subclasses, registered inTypeApiOps.apply().getEncoderreturns the SPARK-57033 leaves (LocalDateTimeNanosEncoder(p)/InstantNanosEncoder(p)), gated byDataTypeErrors.checkTimestampNanosTypesEnabled().SerializerBuildHelper,DeserializerBuildHelper,CatalystTypeConverters,EncoderUtils,CodeGenerator,Literal, andInternalRow. InSerializerBuildHelper/DeserializerBuildHelper,OptionEncoder/TransformingEncoderare unwrapped before the framework leaf dispatch, since those wrapper encoders proxydataTypeto the wrapped encoder.MutableTimestampNanostoSpecificInternalRowto avoid theMutableAnyfallback.checkValueonspark.sql.timestampNanosTypes.enabledrequiringspark.sql.types.framework.enabled=true, so the types cannot be enabled outside the framework.Fractional-second string formatting is not implemented yet (no
TimestampFormatterfor these types). Until it lands, converting a nanos value to a string (CAST to STRING, EXPLAIN/SHOW output, SQL-literal rendering) raises the newUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRINGerror rather than silently truncating to microseconds. Both the interpreted path (TimestampNanosTypeApiOps.format) and the codegen path (ToStringBase.castToStringCode) raise the identical error, so the two eval modes stay consistent.Out of scope (follow-ups): string formatting/CAST-to-string, Connect proto, Arrow, PySpark conversion, Parquet/ColumnVector, and physical ordering/compare/hash.
Why are the changes needed?
The logical nanosecond timestamp types (SPARK-56876) and the physical row layer (SPARK-56981) already exist, but these types were wired only through scattered legacy dispatch. Centralizing the type-specific operations behind
TypeOps, consistent withTimeType, is a prerequisite for the remaining nanosecond timestamp work and avoids the framework-on/off behavior divergence that the previous per-call-site handling produced.Does this PR introduce any user-facing change?
No. The nanosecond timestamp types are a preview feature gated by
spark.sql.timestampNanosTypes.enabled(andspark.sql.types.framework.enabled), both off by default in production. When these preview flags are enabled, converting a nanos timestamp to a string raisesUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRINGbecause fractional-second formatting is not implemented yet.How was this patch tested?
Added/updated tests:
TimestampNanosTypeOpsSuite(catalyst):TypeOps/TypeApiOpsregistration;PhysicalDataType, defaultLiteral, and codegen Java class;InternalRow/SpecificInternalRowroundtrips incl. the dedicatedMutableTimestampNanosholder;getEncoderreturns the SPARK-57033 nanos encoders;CatalystTypeConvertersjava.timeroundtrip;format/toSQLValueraiseUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING; framework-disabled leaves the types unsupported (no legacy fallback); enabling the nanos types requires the framework flag.TimestampNanosTypeOpsSuitealso covers Option-wrapped nanos encoder roundtrips (Some/None for NTZ and LTZ), verifying wrapper encoders are unwrapped before the framework serde dispatch.TimestampNanosRowSuite(catalyst): CAST nanos -> STRING raises the unsupported-feature error in both interpreted and codegen modes; unsafe/generic row roundtrips; literal validation.All tests pass.
catalyst/sql-apiscalastyle are clean.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor