Skip to content

Conversation

@shangxinli
Copy link
Contributor

Replace GenericDatum intermediate layer with direct Avro decoder access to improve manifest I/O performance.

Changes:

  • Add avro_direct_decoder_internal.h with DecodeAvroToBuilder API
  • Add avro_direct_decoder.cc implementing direct Avro→Arrow decoding
    • Primitive types: bool, int, long, float, double, string, binary, fixed
    • Temporal types: date, time, timestamp
    • Logical types: uuid, decimal
    • Nested types: struct, list, map
    • Union type handling for nullable fields
  • Modify avro_reader.cc to use DataFileReaderBase with direct decoder
    • Replace DataFileReader with DataFileReaderBase
    • Use decoder.decodeInt(), decodeLong(), etc. directly
    • Remove GenericDatum allocation and extraction overhead
  • Update CMakeLists.txt to include new decoder source

Performance improvement:

  • Before: Avro binary → GenericDatum → Extract → Arrow (3 steps)
  • After: Avro binary → decoder.decodeInt() → Arrow (2 steps)

This matches Java implementation which uses Decoder directly via ValueReader interface, avoiding intermediate object allocation.

All avro_test cases pass.

Issue: #332

@shangxinli shangxinli force-pushed the avro_reader branch 7 times, most recently from e7b55b2 to e54929d Compare December 1, 2025 00:56
ICEBERG_RETURN_UNEXPECTED(
AppendDatumToBuilder(reader_->readerSchema().root(), *context_->datum_,
projection_, *read_schema_, context_->builder_.get()));
DecodeAvroToBuilder(reader_->readerSchema().root(), reader_->decoder(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use a feature flag to enable users to use the old reader just in case there is any bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

@shangxinli shangxinli force-pushed the avro_reader branch 2 times, most recently from 2334ea3 to d16b0f1 Compare December 1, 2025 03:17
@wgtmac
Copy link
Member

wgtmac commented Dec 2, 2025

Could you help add some test cases? The original avro reader is simply a wrapper on top of avro-cpp so existing test cases are very minimal. We need better test coverage to be confident. Perhaps we can use the avro writer to create in-memory avro files (using MakeMockFileIO and then use the new reader to read them. It is also nice to add some benchmarks to prove that this indeed improves the reading performance compared to the original implementation.

@shangxinli shangxinli force-pushed the avro_reader branch 3 times, most recently from eae860e to 4ed80b0 Compare December 3, 2025 23:49
Status SkipAvroValue(const ::avro::NodePtr& avro_node, ::avro::Decoder& decoder) {
switch (avro_node->type()) {
case ::avro::AVRO_NULL:
// Nothing to skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to call decoder.decodeNull()? Even if nothing todo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

ToString(avro_node));
}
auto* builder = internal::checked_cast<::arrow::BinaryBuilder*>(array_builder);
std::vector<uint8_t> bytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to avoid frequent small object allocation like this? Perhaps we can reuse it by adding a class DecodeContext where a std::vector<uint8_t> scratch object is supposed to be used from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other temp vector and string variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense

return InvalidArgument("Expected Avro fixed for decimal field, got: {}",
ToString(avro_node));
}
if (avro_node->logicalType().type() != ::avro::LogicalType::DECIMAL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine the two ifs above just like other branches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

// Handle null fields (fields in projection but not in Avro)
for (size_t proj_idx = 0; proj_idx < projections.size(); ++proj_idx) {
const auto& field_projection = projections[proj_idx];
if (field_projection.kind == FieldProjection::Kind::kNull) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return error for other unsupported kind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


/// \brief Use direct Avro decoder (true) or GenericDatum-based decoder (false).
/// Default: true (use direct decoder for better performance).
inline static Entry<bool> kAvroUseDirectDecoder{"avro.use-direct-decoder", true};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it read.avro.skip-datum, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I've renamed it to follow the naming convention and be more descriptive.

Changes:

  • kAvroUseDirectDecoder → kAvroSkipDatum
  • "avro.use-direct-decoder" → "read.avro.skip-datum"

WriteAndVerify(schema, expected_string);
}

TEST_F(AvroReaderTest, MapType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test a map with a non-string-typed key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've added a test for maps with non-string keys.

}

// Test both direct decoder and GenericDatum paths
TEST_F(AvroReaderTest, DirectDecoderVsGenericDatum) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend changing it to parameterized test to enable/disable direct decoder for all cases in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the Avro tests to use parameterized tests as recommended. Created AvroReaderParameterizedTest fixture and Converted 12 tests to TEST_P. Added INSTANTIATE_TEST_SUITE_P and Removed DirectDecoderVsGenericDatum test.

}
}

TEST_F(AvroReaderTest, LargeDataset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't have test for projection. For example, only some columns are selected and they are reordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've added a test for column projection with subset selection and reordering.

@wgtmac
Copy link
Member

wgtmac commented Dec 11, 2025

BTW, it is possible to add a executable under iceberg/avro folder named avro_scan? We can use it to benchmark on a local avro file and be able to config whether to skip datum or not.

@shangxinli
Copy link
Contributor Author

BTW, it is possible to add a executable under iceberg/avro folder named avro_scan? We can use it to benchmark on a local avro file and be able to config whether to skip datum or not.

Just created the avro_scan benchmark executable in src/iceberg/avro/.

Features:

  • Configurable decoder: --skip-datum=true (direct decoder) or false (GenericDatum)
  • Batch size control: --batch-size=N to configure batch sizes
  • Performance metrics: Reports total rows, time, and throughput (rows/sec and MB/sec)
  • Schema inspection: Displays the schema of the Avro file

Usage:

Build

cmake --build build --target avro_scan

Run with direct decoder (default)

./build/src/iceberg/avro/avro_scan data.avro

Run with GenericDatum decoder

./build/src/iceberg/avro/avro_scan --skip-datum=false data.avro

Custom batch size

./build/src/iceberg/avro/avro_scan --batch-size=1000 --skip-datum=true data.avro

Help

./build/src/iceberg/avro/avro_scan --help

Output example:
Scanning Avro file: data.avro
Skip datum: true
Batch size: 4096

File size: 12345678 bytes
Schema: struct<id: int64, name: string, value: double>

Results:
Total rows: 1000000
Batches: 245
Time: 1250 ms
Throughput: 800000 rows/sec
Throughput: 9.45 MB/sec

shangxinli and others added 5 commits December 15, 2025 16:48
Replace GenericDatum intermediate layer with direct Avro decoder access
to improve manifest I/O performance.

Changes:
- Add avro_direct_decoder_internal.h with DecodeAvroToBuilder API
- Add avro_direct_decoder.cc implementing direct Avro→Arrow decoding
  - Primitive types: bool, int, long, float, double, string, binary, fixed
  - Temporal types: date, time, timestamp
  - Logical types: uuid, decimal (with validation)
  - Nested types: struct, list, map
  - Union type handling with bounds checking
  - Field skipping with proper multi-block handling for arrays/maps
- Modify avro_reader.cc to use DataFileReaderBase with direct decoder
  - Replace DataFileReader<GenericDatum> with DataFileReaderBase
  - Use decoder.decodeInt(), decodeLong(), etc. directly
  - Remove GenericDatum allocation and extraction overhead
- Update CMakeLists.txt to include new decoder source

Validation added:
- Union branch bounds checking
- Decimal byte width validation (uses schema fixedSize, not calculated)
- Decimal precision sufficiency validation
- Logical type presence validation
- Type mismatch error handling

Documentation:
- Comprehensive API documentation in header
- Schema evolution handling via SchemaProjection explained
- Error handling behavior documented
- Limitations noted (default values not supported)

Performance improvement:
- Before: Avro binary → GenericDatum → Extract → Arrow (3 steps)
- After: Avro binary → decoder.decodeInt() → Arrow (2 steps)

This matches Java implementation which uses Decoder directly via
ValueReader interface, avoiding intermediate object allocation.

All 173 avro_test cases pass.

Issue: apache#332
Add extensive test coverage to validate the direct decoder implementation:
- All primitive types (boolean, int, long, float, double, string, binary)
- Temporal types (date, time, timestamp)
- Complex nested structures (nested structs, lists, maps)
- Null handling and optional fields
- Large datasets (1000+ rows)
- Direct decoder vs GenericDatum comparison tests

Add benchmark tool to measure performance improvements:
- Benchmarks with various data patterns (primitives, nested, lists, nulls)
- Compares direct decoder vs GenericDatum performance
- Expected speedup: 1.5x - 2.5x due to eliminated intermediate copies

Add feature flag for direct Avro decoder:
- ReaderProperties::kAvroUseDirectDecoder (default: true)
- Allows fallback to GenericDatum implementation if issues arise
- Dual-path implementation with helper functions to reduce code duplication

Test results:
- 16 comprehensive Avro reader tests (vs 5 before)
- 180 total tests in avro_test suite
- 100% passing rate

This addresses review feedback from wgtmac to provide better test coverage
and prove performance improvements of the direct decoder implementation.
Address reviewer comment to explicitly call decoder.decodeNull() even
though AVRO_NULL has no data to skip. This is more consistent with
other type handlers and makes the decoder state handling explicit.
- Replace 'new/legacy path' terminology with 'DirectDecoder/GenericDatum'
- Add DecodeContext to reuse scratch buffers and avoid allocations
- Combine decimal type validation checks
- Add error handling for unsupported FieldProjection kinds
- Cache avro_to_projection mapping in DecodeContext
- Rename kAvroUseDirectDecoder to kAvroSkipDatum
- Add test for map with non-string keys
- Add test for column projection with subset and reordering
- Create avro_scan benchmark executable
- Convert tests to parameterized tests for both decoder modes
- Fix temp file path handling for parameterized test names
@wgtmac
Copy link
Member

wgtmac commented Dec 15, 2025

Thanks for working on this @shangxinli!

@wgtmac wgtmac merged commit 1c67e4c into apache:main Dec 15, 2025
10 checks passed
@wgtmac wgtmac mentioned this pull request Dec 19, 2025
2 tasks
shangxinli added a commit to shangxinli/iceberg-cpp that referenced this pull request Dec 26, 2025
Remove unnecessary defensive checks for ListType/MapType in AVRO_RECORD case.
These types are encoded as AVRO_ARRAY and AVRO_MAP respectively, so the
defensive checks will never trigger in normal usage.

This follows PR apache#374 feedback: "I don't think we need this check."

Changes:
- Simplify type checking to only validate StructType
- Remove unreachable ListType/MapType error paths
- Add clarifying comment about type mapping
- Reduce code complexity from 17 lines to 8 lines

All 185 tests passing
shangxinli added a commit to shangxinli/iceberg-cpp that referenced this pull request Dec 26, 2025
Implement direct Avro encoder to eliminate GenericDatum intermediate layer,
matching the approach used in the direct decoder (PR apache#374).

Changes:
- Complete AVRO_ARRAY encoding for both ListType and MapType (non-string keys)
- Complete AVRO_MAP encoding for maps with string keys
- Add temporal type support (Date32Array, Time64Array, TimestampArray)
- Change signature from Schema to Type for proper nested type handling
- Fix union branch encoding and map value node access
- Add EncodeContext to reuse scratch buffers and avoid allocations
- Simplify AVRO_RECORD type checking, remove unnecessary defensive checks

Testing:
- Add 8 comprehensive writer tests (WritePrimitiveTypes, WriteTemporalTypes,
  WriteNestedStruct, WriteListType, WriteMapTypeWithStringKey,
  WriteMapTypeWithNonStringKey, WriteOptionalFields, WriteLargeDataset)
- Tests verify encoder output directly using Avro GenericDatum reader
- Total: 185 tests, all passing (up from 177)

Performance improvement:
- Before: Arrow → GenericDatum → Avro binary (3 steps)
- After: Arrow → encoder.encodeInt() → Avro binary (2 steps)

This matches Java Iceberg implementation and eliminates intermediate
object allocation overhead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants