From 21c6e6ec05d66a2dd0543dda931dde47afbb72fb Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Mon, 23 Feb 2026 12:38:23 -0500 Subject: [PATCH 1/4] first draft extension_types Signed-off-by: Connor Tsui --- proposals/0005-extension-types.md | 97 +++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) create mode 100644 proposals/0005-extension-types.md diff --git a/proposals/0005-extension-types.md b/proposals/0005-extension-types.md new file mode 100644 index 0000000..0d7c4b1 --- /dev/null +++ b/proposals/0005-extension-types.md @@ -0,0 +1,97 @@ +- Start Date: (2026-02-27) +- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) +- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547) + +## Summary + +We would like to build a more robust system for extension data types (or DTypes). + +TODO + +## Motivation + +TODO + +## Design + +[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. + +There were a few blockers (detailed in the previous tracking issue [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved we can move forward with this. + +Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. It will look something like so: + +```rust +// Naming should be considered VERY unstable / not set! + +pub trait ExtVTable: 'static + Send + Sync + ... { + // Extra data that complements the extension type. + type Metadata: ...; + + // A native Rust value that represents a scalar of the extension type. + type Value<'a>: Display; + + // `DType` + + fn id(&self) -> ExtID; + fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; + fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult>; + fn deserialize_metadata(&self, data: &[u8]) -> VortexResult; + + // `Scalar` + + fn validate_scalar_value(&self, metadata: &Self::Metadata, storage_dtype: &DType, storage_value: &ScalarValue) -> VortexResult<()>; + fn unpack<'a>(&self, metadata: &'a Self::Metadata, storage_dtype: &'a DType, storage_value: &'a ScalarValue) -> Self::Value<'a>; + fn cast_scalar(&self, metadata: &Self::Metadata, scalar: &Scalar, target: &DType) -> VortexResult { ... } + + // `ArrayRef` + + fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>; + fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult { ... } + fn other_compute_thing???(&self, ...) -> VortexResult { ... } + // <-- Probably a lot more than this --> +} +``` + +TODO + +## Compatibility + +TODO + +## Drawbacks + +TODO + +## Alternatives + +TODO + +## Prior Art + +TODO + +## Unresolved Questions + +TODO + +## Future Possibilities + +If we can get extension types working well, then theoretically we can easily add all of these types: + +- `DateTimeParts` (`Primitive`) +- Matrix (`FixedSizeList`) +- Tensor (`FixedSizeList`) +- UUID (Do we need to add `FixedSizeBinary` as a canonical type?) +- JSON (`UTF8`) +- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`) +- Variant + - Shredding (Lots of possibilities here!) +- Union + - Sparse (`Struct { Primitive, Struct { types } }`) + - Dense[^1] +- Map (`List`) +- Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView`) +- `Struct` but with protobuf-style field numbers (`Struct`) +- Probably lots more! + +[^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do. From c0bf284b504bea9fe1a78c231d7a16fef777670e Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 27 Feb 2026 10:00:02 -0500 Subject: [PATCH 2/4] reformat Signed-off-by: Connor Tsui --- proposals/0005-extension-types.md | 106 ++++++++++++++++++++++++------ 1 file changed, 86 insertions(+), 20 deletions(-) diff --git a/proposals/0005-extension-types.md b/proposals/0005-extension-types.md index 0d7c4b1..134e11f 100644 --- a/proposals/0005-extension-types.md +++ b/proposals/0005-extension-types.md @@ -1,5 +1,5 @@ - Start Date: (2026-02-27) -- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) +- RFC PR: [vortex-data/rfcs#5](https://github.com/vortex-data/rfcs/pull/5) - Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547) ## Summary @@ -10,38 +10,103 @@ TODO ## Motivation -TODO +The unfortunate reality of the type system in Vortex is that we cannot easily add +new or update logical types. +For example, the effort to add `FixedSizeList` ([vortex-data/vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) +and also change `List` to `ListView` ([vortex-data/vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) +was very intrusive. + +Vortex provides an `Extension` canonical variant to help with this. +Currently, implementors can add a new extension type by adding a new extension ID +(for example, `vortex.time` or `vortex.date`) and specifying a canonical storage type that behaves +like the "physical" type of the extension type. +For example, the time extension types have a primitive storage type, meaning they wrap the primitive +scalars or primitive arrays with some extra logic on top of it +(mostly validating timestamps are valid). + +Right now, ## Design -[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. +[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables +(virtual tables, or Rust unit structs with methods) for extension `DType`s. +Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, +serialization, and metadata. +The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. -There were a few blockers (detailed in the previous tracking issue [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved we can move forward with this. +There were a few blockers (detailed in the previous tracking issue +[vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), +but now that those have been resolved we can move forward with this. -Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. It will look something like so: +Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, +we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. +It will look something like so: ```rust // Naming should be considered VERY unstable / not set! -pub trait ExtVTable: 'static + Send + Sync + ... { - // Extra data that complements the extension type. - type Metadata: ...; +/// The public API for defining new extension types. +/// +/// This is the non-object-safe trait that plugin authors implement to define a new extension +/// type. It specifies the type's identity, metadata, serialization, and validation. +pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash { + /// Associated type containing the deserialized metadata for this extension type. + type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash; - // A native Rust value that represents a scalar of the extension type. - type Value<'a>: Display; + /// A native Rust value that represents a scalar of the extension type. + /// + /// The value only represents non-null values. We denote nullable values as `Option`. + type NativeValue<'a>: Display; - // `DType` + /// Returns the ID for this extension type. + fn id(&self) -> ExtId; - fn id(&self) -> ExtID; - fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; + // Methods related to the extension `DType`. + + /// Serialize the metadata into a byte vector. fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult>; - fn deserialize_metadata(&self, data: &[u8]) -> VortexResult; - // `Scalar` + /// Deserialize the metadata from a byte slice. + fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult; + + /// Validate that the given storage type is compatible with this extension type. + fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; - fn validate_scalar_value(&self, metadata: &Self::Metadata, storage_dtype: &DType, storage_value: &ScalarValue) -> VortexResult<()>; - fn unpack<'a>(&self, metadata: &'a Self::Metadata, storage_dtype: &'a DType, storage_value: &'a ScalarValue) -> Self::Value<'a>; - fn cast_scalar(&self, metadata: &Self::Metadata, scalar: &Scalar, target: &DType) -> VortexResult { ... } + // Methods related to the extension scalar values. + + /// Validate the given storage value is compatible with the extension type. + /// + /// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the + /// result. + /// + /// # Errors + /// + /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. + fn validate_scalar_value( + &self, + metadata: &Self::Metadata, + storage_dtype: &DType, + storage_value: &ScalarValue, + ) -> VortexResult<()> { + self.unpack_native(metadata, storage_dtype, storage_value) + .map(|_| ()) + } + + /// Validate and unpack a native value from the storage [`ScalarValue`]. + /// + /// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage + /// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the + /// storage value is compatible with the storage dtype on construction. + /// + /// # Errors + /// + /// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type. + fn unpack_native<'a>( + &self, + metadata: &'a Self::Metadata, + storage_dtype: &'a DType, + storage_value: &'a ScalarValue, + ) -> VortexResult>; // `ArrayRef` @@ -84,14 +149,15 @@ If we can get extension types working well, then theoretically we can easily add - UUID (Do we need to add `FixedSizeBinary` as a canonical type?) - JSON (`UTF8`) - PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`) -- Variant - - Shredding (Lots of possibilities here!) - Union - Sparse (`Struct { Primitive, Struct { types } }`) - Dense[^1] - Map (`List`) - Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView`) - `Struct` but with protobuf-style field numbers (`Struct`) +- **NOT** Variant[^2] - Probably lots more! [^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do. + +[^2]: TODO From 4759c5cbcdb2ca545bb23941778bd02db1b56a39 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 27 Feb 2026 12:06:15 -0500 Subject: [PATCH 3/4] clean up Signed-off-by: Connor Tsui --- proposals/0005-extension-types.md | 159 +++++++++++++++++++++--------- 1 file changed, 115 insertions(+), 44 deletions(-) diff --git a/proposals/0005-extension-types.md b/proposals/0005-extension-types.md index 134e11f..368fb84 100644 --- a/proposals/0005-extension-types.md +++ b/proposals/0005-extension-types.md @@ -4,46 +4,76 @@ ## Summary -We would like to build a more robust system for extension data types (or DTypes). - -TODO +We would like to build a more robust system for extension data types (or `DType`s). This RFC +proposes a direction for extending the `ExtVTable` trait to support richer behavior (beyond +forwarding to the storage type), lays out the completed and in-progress work, and identifies the +open questions that remain. ## Motivation -The unfortunate reality of the type system in Vortex is that we cannot easily add -new or update logical types. -For example, the effort to add `FixedSizeList` ([vortex-data/vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) -and also change `List` to `ListView` ([vortex-data/vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) -was very intrusive. - -Vortex provides an `Extension` canonical variant to help with this. -Currently, implementors can add a new extension type by adding a new extension ID -(for example, `vortex.time` or `vortex.date`) and specifying a canonical storage type that behaves -like the "physical" type of the extension type. -For example, the time extension types have a primitive storage type, meaning they wrap the primitive -scalars or primitive arrays with some extra logic on top of it -(mostly validating timestamps are valid). - -Right now, +A limitation of the current type system in Vortex is that we cannot easily add new logical types. +For example, the effort to add `FixedSizeList` +([vortex#4372](https://github.com/vortex-data/vortex/issues/4372)) and also change `List` to +`ListView` ([vortex#4699](https://github.com/vortex-data/vortex/issues/4699)) was very intrusive. +It is much easier to add wrappers around canonical types (treating the canonical dtype as a +"storage type") and implement some additional logic than to add a new variant to the `DType` enum. + +Vortex provides an `Extension` variant of `DType` to help with this. Currently, implementors can add +a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) and +specifying a canonical storage type that behaves like the "physical" type of the extension type. +For example, the time extension types use a primitive storage type, meaning they wrap the primitive +scalars or primitive arrays with some extra logic on top of it (mostly validating that the +timestamps are valid). + +We would like to add many more extension types. Some notable extension types (and their likely +storage types) include: + +- **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond + to levels of nesting. There are many open questions on the design of this, but that is out of + scope of this RFC. +- **Union**: The sum type of an algebraic data type, like a Rust enum. We would likely implement + this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`). + Vortex may be well suited to represent this because it can compress each of the type field arrays + independently, so we do not need to distinguish between a "Sparse" or "Dense" Union. +- **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of + scope for this RFC. + +The issue with the current system is that it only forwards logic to the underlying storage type. +The only other behavior we support is serializing and pretty-printing extension arrays. This means +that we cannot define custom compute logic for extension types. + +Take the time extension types as an example of where this limitation does not matter. If we want to +run a `compare` expression over a timestamp array, we just run the `compare` over the underlying +primitive array. For simple types like timestamps, this is sufficient (and this is what we do right +now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also +probably fine. + +However, for more complex types like UUID, Union, or JSON, this is likely not sufficient. It +remains an open question whether forwarding to the storage type is always enough, but assuming we do +need richer behavior, we want a more robust implementation path (instead of wrapping +`ExtensionArray` and performing significant internal dispatch work). ## Design -[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables -(virtual tables, or Rust unit structs with methods) for extension `DType`s. -Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, -serialization, and metadata. +### Completed Work + +[vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, +or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`) +now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. -There were a few blockers (detailed in the previous tracking issue -[vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), -but now that those have been resolved we can move forward with this. +There were a few blockers (detailed in the tracking issue +[vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), +but now that those have been resolved, we can move forward. -Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, -we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. -It will look something like so: +### Current Work + +Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place +all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (it has been renamed now). +It will look something like the following: ```rust -// Naming should be considered VERY unstable / not set! +// Note: naming should be considered unstable. /// The public API for defining new extension types. /// @@ -112,36 +142,68 @@ pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash { fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>; fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult { ... } - fn other_compute_thing???(&self, ...) -> VortexResult { ... } - // <-- Probably a lot more than this --> + // Additional compute methods TBD. +} +``` + +Most of the implementation work will be making sure that `ExtDTypeRef` (which we pass around as the +`Extension` variant of `DType`) has the correct methods that access the internal, type-erased +`ExtVTable`. + +Take extension scalars as an example. The only behavior we need from extension scalars is validating +that they have correct values, displaying them, and unpacking them into native types. So we added +these methods to `ExtDTypeRef`: + +```rust +impl ExtDTypeRef { + /// Formats an extension scalar value using the current dtype for metadata context. + pub fn fmt_storage_value<'a>( + &'a self, + f: &mut fmt::Formatter<'_>, + storage_value: &'a ScalarValue, + ) -> fmt::Result { ... } + + /// Validates that the given storage scalar value is valid for this dtype. + pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... } } ``` -TODO +We do not yet know what the API for extension arrays should look like, so it is hard to say what +other methods should exist on `ExtDTypeRef`. ## Compatibility -TODO +This should not break anything because extension types are mostly related to in-memory APIs (since +data is read from and written to disk as the storage type). + +There could potentially be performance optimizations we could make with the new extension type +system, but it is hard to know for sure. ## Drawbacks -TODO +As stated before, this might be overkill. ## Alternatives -TODO +We could have many `ExtensionArray` wrappers with custom logic. This +might not even work, but even if it did, it would likely be very clunky. ## Prior Art -TODO +Apache Arrow allows defining +[extension types](https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types) +and also provides a +[set of canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html). ## Unresolved Questions -TODO +It is not yet clear whether changes to our extension types are strictly necessary, or whether +forwarding to the storage type will always suffice. We also do not yet know how to define compute +expressions over extension arrays. ## Future Possibilities -If we can get extension types working well, then theoretically we can easily add all of these types: +If we can get extension types working well, we can add all of the following types: - `DateTimeParts` (`Primitive`) - Matrix (`FixedSizeList`) @@ -153,11 +215,20 @@ If we can get extension types working well, then theoretically we can easily add - Sparse (`Struct { Primitive, Struct { types } }`) - Dense[^1] - Map (`List`) -- Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView`) +- Tags: See this + [discussion](https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892), + where we think we can represent this with (`ListView`) - `Struct` but with protobuf-style field numbers (`Struct`) - **NOT** Variant[^2] -- Probably lots more! - -[^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do. - -[^2]: TODO +- And likely more. + +[^1]: + `Struct` doesn't work here because children can have different lengths, but what we could do + is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would + effectively be the exact same but with the overhead of tracking indices for each of the child + fields. In that case, it might just be better to always use a "sparse" union and let the + compressor decide what to do. + +[^2]: + We likely cannot implement `Variant` as an extension type because we have no way of defining + what the storage type would be (since we have no idea what the schema is for each row). From b0fc69f9eeb0e3e43c6a0f7b1e8d8b69b3e36855 Mon Sep 17 00:00:00 2001 From: Connor Tsui Date: Fri, 27 Feb 2026 13:36:53 -0500 Subject: [PATCH 4/4] more clarity on open questions Signed-off-by: Connor Tsui --- proposals/0005-extension-types.md | 50 ++++++++++++++++--------------- 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/proposals/0005-extension-types.md b/proposals/0005-extension-types.md index 368fb84..18ea5a5 100644 --- a/proposals/0005-extension-types.md +++ b/proposals/0005-extension-types.md @@ -22,8 +22,8 @@ Vortex provides an `Extension` variant of `DType` to help with this. Currently, a new extension type by defining an extension ID (for example, `vortex.time` or `vortex.date`) and specifying a canonical storage type that behaves like the "physical" type of the extension type. For example, the time extension types use a primitive storage type, meaning they wrap the primitive -scalars or primitive arrays with some extra logic on top of it (mostly validating that the -timestamps are valid). +scalars or primitive arrays with some extra logic on top (mostly validating that the timestamps are +valid). We would like to add many more extension types. Some notable extension types (and their likely storage types) include: @@ -31,9 +31,9 @@ storage types) include: - **Matrix / Tensor**: This would be an extension over `FixedSizeList`, where dimensions correspond to levels of nesting. There are many open questions on the design of this, but that is out of scope of this RFC. -- **Union**: The sum type of an algebraic data type, like a Rust enum. We would likely implement +- **Union**: The sum type of an algebraic data type, like a Rust enum. One approach is to implement this with a type tag paired with a `Struct` (so `Struct { Primitive, Struct { types } }`). - Vortex may be well suited to represent this because it can compress each of the type field arrays + Vortex is well suited to represent this because it can compress each of the type field arrays independently, so we do not need to distinguish between a "Sparse" or "Dense" Union. - **UUID**: Since this is a 128-bit number, we likely want to add `FixedSizeBinary`. This is out of scope for this RFC. @@ -46,16 +46,16 @@ Take the time extension types as an example of where this limitation does not ma run a `compare` expression over a timestamp array, we just run the `compare` over the underlying primitive array. For simple types like timestamps, this is sufficient (and this is what we do right now). For types like Tensors (which are simply type aliases over `FixedSizeList`), this is also -probably fine. +fine. -However, for more complex types like UUID, Union, or JSON, this is likely not sufficient. It -remains an open question whether forwarding to the storage type is always enough, but assuming we do -need richer behavior, we want a more robust implementation path (instead of wrapping -`ExtensionArray` and performing significant internal dispatch work). +However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely +insufficient as these types need custom compute logic. Given that, we want a more robust +implementation path instead of wrapping `ExtensionArray` and performing significant internal +dispatch work. ## Design -### Completed Work +### Background [vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g., `Timestamp`) @@ -66,10 +66,12 @@ There were a few blockers (detailed in the tracking issue [vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved, we can move forward. -### Current Work +### Proposed Design Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can place -all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (it has been renamed now). +all extension logic (for types, scalars, and arrays) onto an `ExtVTable` (renamed from +`ExtDTypeVTable`). + It will look something like the following: ```rust @@ -168,25 +170,23 @@ impl ExtDTypeRef { } ``` -We do not yet know what the API for extension arrays should look like, so it is hard to say what -other methods should exist on `ExtDTypeRef`. +**Open question**: What should the API for extension arrays look like? The answer will determine +what additional methods `ExtDTypeRef` needs beyond the scalar-related ones shown above. ## Compatibility This should not break anything because extension types are mostly related to in-memory APIs (since data is read from and written to disk as the storage type). -There could potentially be performance optimizations we could make with the new extension type -system, but it is hard to know for sure. - ## Drawbacks -As stated before, this might be overkill. +If forwarding to the storage type turns out to be sufficient for all extension types, the +additional vtable surface area adds complexity without clear benefit. ## Alternatives -We could have many `ExtensionArray` wrappers with custom logic. This -might not even work, but even if it did, it would likely be very clunky. +We could have many `ExtensionArray` wrappers with custom logic. This approach would be clunky and +may not scale. ## Prior Art @@ -197,9 +197,11 @@ and also provides a ## Unresolved Questions -It is not yet clear whether changes to our extension types are strictly necessary, or whether -forwarding to the storage type will always suffice. We also do not yet know how to define compute -expressions over extension arrays. +- Is forwarding to the storage type insufficient, and which extension types genuinely need custom + compute logic? +- What should the `ExtVTable` API for extension arrays look like? What methods beyond + `validate_array` are needed? +- How should compute expressions be defined and dispatched for extension types? ## Future Possibilities @@ -231,4 +233,4 @@ If we can get extension types working well, we can add all of the following type [^2]: We likely cannot implement `Variant` as an extension type because we have no way of defining - what the storage type would be (since we have no idea what the schema is for each row). + what the storage type would be (since the schema is not known ahead of time for each row).