Skip to content

Conversation

@cbb330
Copy link

@cbb330 cbb330 commented Jan 27, 2026

Summary

Part 1/15 of ORC predicate pushdown implementation.

Adds foundation for statistics-based predicate pushdown:

  • Add GetStripeStatistics() API to ORCFileReader
  • Forward declarations for liborc types
  • Infrastructure for accessing ORC stripe-level statistics

This is the first building block that enables stripe filtering based on column statistics.

Changes

  • adapter.h: Add GetStripeStatistics API
  • adapter.cc: Implement statistics retrieval

Rationale

ORC files store min/max statistics at the stripe level. This API exposes those statistics to enable predicate pushdown at the Dataset API layer, following the same pattern as Parquet row group statistics.

Part of stacked PR series for ORC predicate pushdown.

Add internal utilities for extracting min/max statistics from ORC
stripe metadata. This establishes the foundation for statistics-based
stripe filtering in predicate pushdown.

Changes:
- Add MinMaxStats struct to hold extracted statistics
- Add ExtractStripeStatistics() function for INT64 columns
- Statistics extraction returns std::nullopt for missing/invalid data
- Validates statistics integrity (min <= max)

This is an internal-only change with no public API modifications.
Part of incremental ORC predicate pushdown implementation (PR1/15).
@cbb330 cbb330 changed the title GH-48986: [C++][Dataset] Add ORC stripe statistics extraction foundation GH-48986: [C++][Dataset] Add ORC stripe statistics extraction foundation (1/15) Jan 27, 2026
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. As a maintainer I find the 15 draft PRs slighlty complex to navigate. Wouldn't it be better to submit one by one?
On our current workflow we use an issue per PR, it would be best if you could create subissues for the main issue.

@cbb330
Copy link
Author

cbb330 commented Jan 28, 2026

Thanks for the feedback @raulcd , that makes sense!

My intent with the multiple draft PRs was to keep each change small, reviewable, and safely mergeable, but I can see how having them all open at once adds noise and makes the overall picture harder to follow.

I’m happy to adapt to the project’s workflow. I can:

  • Create sub-issues under the main issue to outline the overall plan and attach a high level implementation plan there and,
  • Close or keep the remaining PRs as local branches and submit them one by one as we go.

Let me know if you have a preference for how you’d like me to proceed

@raulcd
Copy link
Member

raulcd commented Jan 28, 2026

Thanks for your answer. It is ok, I was just going over latest PRs and found a bunch of draft ones which I am not sure if they are ready to review or not and just found it distracting but again thanks for your work on this!

When I've seen changes that require a multiple PR approach on the Arrow repository they usually have followed the following:

  • added a detailed plan on the main issue where the plan was discussed and agreed, maybe reviewers prefer less steps?
  • create sub-issues per each mergeable change
  • create PRs following the plan

This might also help you with rebasing and fixing any CI issue that might appear.

@wgtmac is probably the best suited to review those changes and discuss the feature (as both maintainer of Arrow and ORC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants