feat: Add Dictionary schema read support in IPC reader#738
feat: Add Dictionary schema read support in IPC reader#738paleolimbot merged 22 commits intoapache:mainfrom
Conversation
|
This looks great! |
9724dd2 to
3083eb9
Compare
a3a0d31 to
4885b74
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #738 +/- ##
==========================================
- Coverage 79.96% 79.67% -0.30%
==========================================
Files 105 106 +1
Lines 15461 15726 +265
Branches 1738 1769 +31
==========================================
+ Hits 12364 12529 +165
- Misses 1998 2076 +78
- Partials 1099 1121 +22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| #if defined(NANOARROW_BUILD_TESTS_WITH_ARROW) | ||
| class ArrowTypeParameterizedTestFixture | ||
| : public ::testing::TestWithParam<std::shared_ptr<arrow::DataType>> { | ||
| protected: | ||
| std::shared_ptr<arrow::DataType> data_type; | ||
| }; | ||
|
|
||
| TEST_P(ArrowTypeParameterizedTestFixture, NanoarrowIpcArrowTypeRoundtrip) { |
There was a problem hiding this comment.
These are just moved because I found the order of the tests was hard to parse with respect to where the parameter s were defined.
| TEST(NanoarrowIpcTest, NanoarrowIpcFooterDecodingErrors) { | ||
| struct ArrowError error; | ||
|
|
||
| nanoarrow::ipc::UniqueDecoder decoder; | ||
| ArrowIpcDecoderInit(decoder.get()); | ||
|
|
||
| // not enough data to get the size+magic |
There was a problem hiding this comment.
These are also just moved around
| struct ArrowBufferView data; | ||
| data.data.as_uint8 = kDictionarySchema; | ||
| data.size_bytes = sizeof(kDictionarySchema); | ||
|
|
||
| ASSERT_EQ(ArrowIpcDecoderInit(&decoder), NANOARROW_OK); | ||
|
|
||
| EXPECT_EQ(ArrowIpcDecoderDecodeHeader(&decoder, data, &error), NANOARROW_OK); | ||
| ASSERT_EQ(decoder.message_type, NANOARROW_IPC_MESSAGE_TYPE_SCHEMA); | ||
|
|
||
| ASSERT_EQ(ArrowIpcDecoderDecodeSchema(&decoder, &schema, &error), NANOARROW_OK); | ||
| ASSERT_EQ(schema.n_children, 1); | ||
| EXPECT_STREQ(schema.children[0]->name, "some_col"); | ||
| EXPECT_EQ(schema.children[0]->flags, ARROW_FLAG_NULLABLE); | ||
| EXPECT_STREQ(schema.children[0]->format, "c"); | ||
|
|
||
| ASSERT_NE(schema.children[0]->dictionary, nullptr); | ||
| EXPECT_STREQ(schema.children[0]->dictionary->format, "u"); |
There was a problem hiding this comment.
This is the crux of what this PR does: it can decode a schema message that contains a dictionary instead of erroring.
There was a problem hiding this comment.
Pull request overview
Adds initial Arrow IPC dictionary support by decoding dictionary-encoded fields in Schema messages and exposing DictionaryBatch headers in the decoder API, while keeping dictionary array decoding itself unsupported for now.
Changes:
- Introduces a public
ArrowIpcDictionaryBatchstruct and exposes the last decoded DictionaryBatch onArrowIpcDecoder. - Adds dictionary encoding handling when decoding Schema fields (including extension-metadata placement).
- Updates reader behavior/tests to recognize DictionaryBatch messages and return
ENOTSUPwhere full decoding isn’t implemented yet.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/nanoarrow/nanoarrow_ipc.h | Exposes DictionaryBatch metadata via new struct and decoder pointer. |
| src/nanoarrow/ipc/reader.c | Detects DictionaryBatch during stream read and returns an explicit not-supported error. |
| src/nanoarrow/ipc/files_test.cc | Updates not-supported expectations for dictionary-related streams. |
| src/nanoarrow/ipc/decoder_test.cc | Adds tests for dictionary schema/batch header decoding; extends type coverage to dictionaries/extensions with skips where unsupported. |
| src/nanoarrow/ipc/decoder.c | Implements schema dictionary-encoding decoding + dictionary batch header decoding; explicitly rejects dictionary array decode. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
If there are no objections I'll merge sometime tomorrow and get started on the harder bits of actually reading dictionaries! |
To start #622, this PR adds read support for Schema messages that contain dictionary-encoded columns. I opened #844 and #845 to track the next steps.