Skip to content

[SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework#56266

Open
MaxGekk wants to merge 8 commits into
apache:masterfrom
MaxGekk:nanos-types-typeops
Open

[SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework#56266
MaxGekk wants to merge 8 commits into
apache:masterfrom
MaxGekk:nanos-types-typeops

Conversation

@MaxGekk
Copy link
Copy Markdown
Member

@MaxGekk MaxGekk commented Jun 2, 2026

What changes were proposed in this pull request?

This PR wires TimestampNTZNanosType(p) and TimestampLTZNanosType(p) (p in [7, 9]) through the Spark SQL Types Framework (SPARK-53504), so that all type-specific behavior for the nanosecond timestamp types is centralized behind TypeOps / TypeApiOps. The nanos types are now supported only through the framework: the scattered legacy dispatch for them is removed.

Concretely:

  • Add TimestampNanosTypeOps (catalyst) with TimestampNTZNanosTypeOps / TimestampLTZNanosTypeOps, registered in TypeOps.apply(). Overrides: getPhysicalType, getJavaClass, getBoxedJavaClass, getRowWriter, getDefaultLiteral, getJavaLiteral, getMutableValue, toCatalystImpl, toScala/toScalaImpl, createSerializer, createDeserializer.
  • Add a getBoxedJavaClass hook to the TypeOps base (the boxed Java class used in codegen). The createSerializer / createDeserializer hooks already exist on the base trait (used by TimeTypeOps); the nanos ops above only override them.
  • Add TimestampNanosTypeApiOps (sql/api) with NTZ/LTZ subclasses, registered in TypeApiOps.apply(). getEncoder returns the SPARK-57033 leaves (LocalDateTimeNanosEncoder(p) / InstantNanosEncoder(p)), gated by DataTypeErrors.checkTimestampNanosTypesEnabled().
  • Remove the nanos branches from the legacy code paths now handled by the framework: SerializerBuildHelper, DeserializerBuildHelper, CatalystTypeConverters, EncoderUtils, CodeGenerator, Literal, and InternalRow. In SerializerBuildHelper / DeserializerBuildHelper, OptionEncoder / TransformingEncoder are unwrapped before the framework leaf dispatch, since those wrapper encoders proxy dataType to the wrapped encoder.
  • Add MutableTimestampNanos to SpecificInternalRow to avoid the MutableAny fallback.
  • Add a checkValue on spark.sql.timestampNanosTypes.enabled requiring spark.sql.types.framework.enabled=true, so the types cannot be enabled outside the framework.

Fractional-second string formatting is not implemented yet (no TimestampFormatter for these types). Until it lands, converting a nanos value to a string (CAST to STRING, EXPLAIN/SHOW output, SQL-literal rendering) raises the new UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING error rather than silently truncating to microseconds. Both the interpreted path (TimestampNanosTypeApiOps.format) and the codegen path (ToStringBase.castToStringCode) raise the identical error, so the two eval modes stay consistent.

Out of scope (follow-ups): string formatting/CAST-to-string, Connect proto, Arrow, PySpark conversion, Parquet/ColumnVector, and physical ordering/compare/hash.

Why are the changes needed?

The logical nanosecond timestamp types (SPARK-56876) and the physical row layer (SPARK-56981) already exist, but these types were wired only through scattered legacy dispatch. Centralizing the type-specific operations behind TypeOps, consistent with TimeType, is a prerequisite for the remaining nanosecond timestamp work and avoids the framework-on/off behavior divergence that the previous per-call-site handling produced.

Does this PR introduce any user-facing change?

No. The nanosecond timestamp types are a preview feature gated by spark.sql.timestampNanosTypes.enabled (and spark.sql.types.framework.enabled), both off by default in production. When these preview flags are enabled, converting a nanos timestamp to a string raises UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING because fractional-second formatting is not implemented yet.

How was this patch tested?

Added/updated tests:

  • TimestampNanosTypeOpsSuite (catalyst): TypeOps/TypeApiOps registration; PhysicalDataType, default Literal, and codegen Java class; InternalRow/SpecificInternalRow roundtrips incl. the dedicated MutableTimestampNanos holder; getEncoder returns the SPARK-57033 nanos encoders; CatalystTypeConverters java.time roundtrip; format/toSQLValue raise UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING; framework-disabled leaves the types unsupported (no legacy fallback); enabling the nanos types requires the framework flag.
  • TimestampNanosTypeOpsSuite also covers Option-wrapped nanos encoder roundtrips (Some/None for NTZ and LTZ), verifying wrapper encoders are unwrapped before the framework serde dispatch.
  • TimestampNanosRowSuite (catalyst): CAST nanos -> STRING raises the unsupported-feature error in both interpreted and codegen modes; unsafe/generic row roundtrips; literal validation.
build/sbt 'catalyst/testOnly *TimestampNanosTypeOpsSuite *TimestampNanosRowSuite'
build/sbt 'core/testOnly org.apache.spark.SparkThrowableSuite'

All tests pass. catalyst / sql-api scalastyle are clean.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented Jun 2, 2026

Please, review this PR @davidm-db @stevomitric @dejankrak-db

Copy link
Copy Markdown
Contributor

@stevomitric stevomitric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few comments, please resolve.

@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented Jun 2, 2026

A general note for reviewers: the dual logic (framework + legacy) is only present for the Time data type, as it had been developed prior to the Types Framework. Every other new data type should go through the framework. Accordingly, this PR routes the nanosecond timestamp types (TimestampNTZNanosType / TimestampLTZNanosType) exclusively through TypeOps / TypeApiOps: the legacy duplicate paths have been removed, and the types are unsupported when spark.sql.types.framework.enabled is off (enforced by a checkValue on spark.sql.timestampNanosTypes.enabled).

@MaxGekk MaxGekk force-pushed the nanos-types-typeops branch from 87cd8a6 to d2f70e6 Compare June 2, 2026 15:18
@MaxGekk MaxGekk changed the title [SPARK-57207][SQL] Register nanosecond timestamp types in the Types Framework via TypeOps overrides [SPARK-57207][SQL] Support nanosecond timestamp types in the Types Framework Jun 2, 2026
MaxGekk added 7 commits June 3, 2026 10:39
…ramework via TypeOps overrides

Registers TimestampNTZNanosType and TimestampLTZNanosType in the Types
Framework by adding TimestampNanosTypeOps (server-side, catalyst) and
TimestampNanosTypeApiOps (client-side, sql/api) with their NTZ/LTZ
concrete subclasses. These only override existing TypeOps/TypeApiOps
methods (physical type, literals, row writer/accessor, mutable value,
codegen class, formatting); no new framework methods are introduced.

A dedicated MutableTimestampNanos holder is added to SpecificInternalRow
to avoid the MutableAny fallback. Encoders remain out of scope
(SPARK-57033), so getEncoder reports the type as unsupported to match the
legacy RowEncoder behavior.
…tamp formatting

Replace the interim TimestampNanosVal.toString passthrough in format() with
SparkException.internalError so callers see a clear message instead of a debug
string on the CAST-to-STRING path. Update the test to expect the error.

Co-authored-by: Isaac
…ypeOps to reuse SPARK-57033

The Types Framework overrides for the nanosecond timestamp types short-circuit
the legacy dispatch when spark.sql.types.framework.enabled is on. They were
written against the pre-SPARK-57033 world, so enabling the framework regressed
the encoder and java.time conversion support that PR apache#56158 already merged:

- getEncoder threw UNSUPPORTED_DATA_TYPE_FOR_ENCODER instead of returning the
  LocalDateTimeNanosEncoder / InstantNanosEncoder added by SPARK-57033.
- toCatalystImpl/toScala/toScalaImpl were identity passthroughs instead of
  converting java.time.LocalDateTime/Instant <-> TimestampNanosVal with
  per-precision truncation, as the legacy TimestampN(TZ|LTZ)NanosConverter does.

Wire getEncoder (gated by spark.sql.timestampNanosTypes.enabled, mirroring the
legacy RowEncoder path) to the precision-aware encoders, and route the catalyst
conversions through the shared DateTimeUtils helpers (raising INVALID_EXTERNAL_VALUE
on bad input). Add tests asserting the encoder/converter parity between the
framework-on and framework-off paths and that the feature-flag gate still holds.

Co-authored-by: Isaac
…ST-to-string behavior

The Types Framework TypeApiOps.format for nanosecond timestamp types threw an
internal "not yet implemented" error, which diverged from the framework-off
path: ToStringBase renders the internal value via UTF8String.fromString(toString)
(interpreted) / String.valueOf (codegen). That made CAST-to-string and display
output differ depending on spark.sql.types.framework.enabled.

Have format() mirror the legacy fallback by rendering the internal
TimestampNanosVal via toString, so enabling the Types Framework does not change
CAST-to-string / display output until dedicated fractional-second formatters
land. Update the test to assert format/toSQLValue follow this behavior.

Co-authored-by: Isaac
… Framework

Route all nanosecond timestamp (TimestampNTZNanosType / TimestampLTZNanosType)
handling through TypeOps / TypeApiOps and remove the legacy duplicate code paths,
so these types are supported only when the Types Framework is enabled and are
unsupported (no silent fallback) when it is disabled.

- Move createSerializer / createDeserializer into TimestampNanosTypeOps and add a
  getBoxedJavaClass hook on TypeOps (overridden by TimeTypeOps) so EncoderUtils,
  Serializer/DeserializerBuildHelper, RowEncoder, CatalystTypeConverters,
  PhysicalDataType, Literal, CodeGenerator and InternalRow no longer special-case
  nanos types.
- Make string formatting raise an internal error until a dedicated TimestampFormatter
  lands: TimestampNanosTypeApiOps.format / toSQLValue throw, and the codegen
  CAST-to-string path in ToStringBase throws the same error so interpreted and
  codegen behavior stay consistent.
- Update the nanos suites to assert framework-only behavior and that CAST-to-string
  fails the same way in both eval modes.
…s timestamp types

Validate via checkValue on spark.sql.timestampNanosTypes.enabled that it can only be
set to true when spark.sql.types.framework.enabled is true, since the nanosecond
timestamp types are implemented solely through the Types Framework. Switch the entry
to createWithDefaultFunction so the validator runs only on explicitly-set values, not
on the default during SQLConf initialization.
…mestamps as strings

Converting a nanosecond timestamp value to a string (CAST to STRING,
EXPLAIN/SHOW output, SQL-literal rendering) is reachable from public APIs
once the preview flags are enabled, so the previous SparkException.internalError
was inappropriate. Introduce the UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING
error condition and raise it from both the interpreted (TimestampNanosTypeApiOps.format)
and codegen (ToStringBase.castToStringCode) paths via a shared DataTypeErrors helper.
@MaxGekk MaxGekk force-pushed the nanos-types-typeops branch from d2f70e6 to 9ac8e3a Compare June 3, 2026 08:43
Copy link
Copy Markdown
Contributor

@davidm-db davidm-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, on the description: the "What changes were proposed" section says this PR adds three hooks to the TypeOps base - getBoxedJavaClass, createSerializer, and createDeserializer. Only getBoxedJavaClass is actually new here; createSerializer and createDeserializer already exist on the base trait (added in #54905), and this PR overrides them on the nanos ops. Might be worth rewording so reviewers don't go looking for base-trait additions that aren't in the diff.

@scala.annotation.tailrec
def javaBoxedType(dt: DataType): Class[_] = dt match {
def javaBoxedType(dt: DataType): Class[_] =
TypeOps(dt).map(_.getBoxedJavaClass).getOrElse(javaBoxedTypeDefault(dt))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question (non-blocking): is getBoxedJavaClass reachable for the framework types today? javaBoxedType has two callers - ValidateExternalType.checkType and MapObjects.elementClassTag - and for non-native encoders both receive the external ObjectType(...) (e.g. ObjectType(LocalTime) / ObjectType(LocalDateTime)), not the raw TimestampNTZNanosType / TimeType; MapObjects's boxed-element branch only fires for native catalyst element types. So TypeOps(dt).map(_.getBoxedJavaClass) appears to return Some only for types that never reach javaBoxedType with their literal form. If that's right, the hook plus this wiring is effectively defensive for now - which is fine and consistent with centralizing the dispatch (the explicit nanos arms it replaces were equally unreached) - I just want to confirm it's intentional rather than closing a live gap. The hook is still the right shape if it ever does become reachable, since getJavaClass returns the primitive class for Long-backed types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude flagged this, I wasn't able to disprove it - so this is more of a question for my knowledge...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your analysis is correct, and it's intentional/defensive rather than closing a live gap. Both callers of javaBoxedType (ValidateExternalType.checkType, MapObjects.elementClassTag) receive the external ObjectType(...) for non-native encoders, and MapObjects's boxed-element branch only fires for native catalyst element types - so TypeOps(dt).map(_.getBoxedJavaClass) returns Some only for dataTypes that don't currently reach javaBoxedType in their raw form. It directly replaces the explicit case _: TimestampNTZNanosType => classOf[TimestampNanosVal] arms that the old javaBoxedType had, which were equally unreached. I kept it because it centralizes the dispatch consistently with the rest of the framework and is the correct shape if a primitive-backed framework type ever does reach javaBoxedType (getBoxedJavaClass returns the boxed class, getJavaClass the primitive).

Copy link
Copy Markdown
Contributor

@stevomitric stevomitric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapper encoders can be bypassed by the new TypeOps serializer/deserializer dispatch. In PR head, SerializerBuildHelper.createSerializer and DeserializerBuildHelper.createDeserializer call TypeOps(enc.dataType) before matching wrapper encoders. But OptionEncoder and TransformingEncoder proxy dataType to the wrapped encoder, so OptionEncoder(LocalDateTimeNanosEncoder(p)).dataType == TimestampNTZNanosType(p). That means the nanos TypeOps path can try to serialize an Option[LocalDateTime] as if it were a bare LocalDateTime, and deserialize to LocalDateTime without WrapOption. Fix by unwrapping OptionEncoder / TransformingEncoder before framework dispatch, and add tests for optional/nested nanos encoder roundtrips.

…rk serde dispatch

SerializerBuildHelper.createSerializer and DeserializerBuildHelper.createDeserializer
dispatch to TypeOps(enc.dataType) before matching wrapper encoders. OptionEncoder and
TransformingEncoder proxy dataType to the wrapped encoder, so e.g.
OptionEncoder(LocalDateTimeNanosEncoder(p)).dataType == TimestampNTZNanosType(p), which
let the framework leaf serializer/deserializer fire on the wrapper and skip
UnwrapOption/WrapOption (and the transforming codec). Route OptionEncoder and
TransformingEncoder through the default path first so the framework leaf dispatch only
sees unwrapped leaf encoders. Add a roundtrip test for Option-wrapped nanos encoders.
@MaxGekk
Copy link
Copy Markdown
Member Author

MaxGekk commented Jun 3, 2026

Thanks for the review @davidm-db @stevomitric. Addressing the two review-summary comments:

@davidm-db (description wording). You're right - only getBoxedJavaClass is newly added to the TypeOps base in this PR; createSerializer / createDeserializer already exist on the base trait and are used by TimeTypeOps, and here they're just overridden on the nanos ops. I've reworded the "What changes were proposed" section so it no longer implies base-trait additions that aren't in the diff.

@stevomitric (wrapper encoders bypassing the framework serde dispatch). Good catch. Since OptionEncoder / TransformingEncoder proxy dataType to the wrapped encoder, OptionEncoder(LocalDateTimeNanosEncoder(p)).dataType == TimestampNTZNanosType(p), so the framework createSerializer / createDeserializer dispatch (which keys off enc.dataType) fired on the wrapper and skipped UnwrapOption / WrapOption (and the codec for TransformingEncoder).

Fixed in 02db736: SerializerBuildHelper.createSerializer / DeserializerBuildHelper.createDeserializer now route OptionEncoder / TransformingEncoder through the default path first, so the framework leaf dispatch only ever sees unwrapped leaf encoders. Added a roundtrip test for Option-wrapped nanos encoders (Some / None, NTZ and LTZ).

For context: this dispatch ordering predates this PR (the helpers have routed through TypeOps(enc.dataType) since the serde hooks landed, and TimeTypeOps already overrides them), so the same fix also covers Option[LocalTime] etc. The nanos leaf encoders aren't reachable via Scala reflection today (Option[LocalDateTime] resolves to the micros LocalDateTimeEncoder), so the gap was latent - but the ordering was fragile, so better to make it robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants