[improvement](be) Add disk cache for external file metadata#63376
[improvement](be) Add disk cache for external file metadata#63376xylaaaaa wants to merge 3 commits into
Conversation
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary: FileMetaCache only cached parsed Parquet/ORC footer metadata in memory, so memory eviction caused repeated remote footer reads from S3/HDFS. This adds a META queue in block file cache and uses it as a disk-backed L2 for Parquet/ORC metadata, with separate memory and disk hit profile counters.
### Release note
Add optional disk-backed external file metadata cache for Parquet/ORC readers.
### Check List (For Author)
- Test: Unit Test
- Added BE UT coverage for META queue eviction priority, read-only file cache lookup, disk file metadata cache read/write, and cache path meta percent parsing.
- Verified changed BE sources by focused C++ compilation from compile_commands.json; full BE UT binary was not run because this workspace would first rebuild/link the large OpenBLAS/test target.
- Behavior changed: Yes. When enable_external_file_meta_disk_cache is true, Parquet/ORC footer metadata can be read from local file cache META queue instead of remote storage.
- Does this need documentation: Yes. BE config documentation should describe enable_external_file_meta_disk_cache and meta_percent.
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary: External file metadata disk cache should be usable without enabling the data file cache. This keeps FileCacheFactory initialization available when only enable_external_file_meta_disk_cache is enabled and adds unit coverage for that configuration.
### Release note
Allow external file metadata disk cache to be enabled independently from data file cache.
### Check List (For Author)
- Test: Unit Test / Manual test
- Added BE UT coverage for initializing metadata disk cache when enable_file_cache is false.
- Manually verified Parquet and ORC local TVF cold, memory-hit, and restart disk-hit profiles.
- Behavior changed: Yes. enable_external_file_meta_disk_cache can initialize metadata disk cache even when enable_file_cache is false.
- Does this need documentation: Yes. BE config documentation should describe enable_external_file_meta_disk_cache and meta_percent.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary: Fix review findings for external file metadata disk cache by protecting ORC footer ownership with unique_ptr, applying global meta percent to legacy four-field file_cache_path configs, clarifying read_if_cached lifetime safety, and aligning file cache evict metric labels with FileCacheType indexes.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Added BE UT coverage for legacy four-field file_cache_path configs with enable_external_file_meta_disk_cache.
- Ran build-support/check-format.sh.
- Ran git diff --check.
- Tried ./run-be-ut.sh --run --filter=BlockFileCacheTest.file_cache_path_storage_parse -j 8, but local link failed before test execution because thirdparty/installed/lib64/libaws-cpp-sdk-kinesis.a is missing.
- Behavior changed: Yes. Legacy four-field file_cache_path configs now reserve external metadata disk cache percent when enable_external_file_meta_disk_cache is true.
- Does this need documentation: No
|
Addressed the actionable review feedback in |
What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary:
External Parquet/ORC footer metadata was only cached in BE memory. After memory cache eviction or BE restart, repeated reads from remote storage such as S3/HDFS had to fetch footer metadata remotely again. This PR adds a disk-backed L2 cache for external file metadata by reusing the file cache META queue, and wires it into Parquet/ORC readers with separate profile counters for memory and disk hits.
Main changes:
enable_external_file_meta_disk_cacheandexternal_file_meta_disk_cache_percentBE configs.FileMetaDiskCacheserialization and validation for Parquet/ORC footer payloads.Release note
Add optional disk-backed external file metadata cache for Parquet/ORC readers.
Check List (For Author)
PATH=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin:$PATH CLANG_FORMAT_BINARY=/mnt/disk1/chenjunwei/doris_tools/ldb_toolchain_llvm16/bin/clang-format build-support/check-format.shpassed.git diff --check origin/master...HEADpassed.enable_file_cache=false../run-be-ut.sh --run --filter='FileMetaCacheTest.*:FileMetaDiskCacheTest.*:BlockFileCacheTest.file_cache_path_storage_parse:BlockFileCacheTest.file_meta_disk_cache_initializes_without_data_file_cache:BlockFileCacheTest.meta_queue_can_evict_data_cache_first:BlockFileCacheTest.normal_queue_does_not_evict_meta_cache:BlockFileCacheTest.read_if_cached_returns_downloaded_meta_block' -j 8; local worktree failed before running tests becausethirdparty/installed/lib64/libaws-cpp-sdk-kinesis.ais missing for current master's BE UT link target.enable_file_cache=falseandenable_external_file_meta_disk_cache=true: cold query wrote disk cache, same-process hot query hit memory cache, BE restart hot query hit disk cache (FileFooterHitDiskCache=1,FileFooterReadCalls=0).FileFooterHitDiskCache=1,FileFooterReadCalls=0).enable_external_file_meta_disk_cacheis true, external Parquet/ORC footer metadata can persist in local file cache and be reused after memory eviction or BE restart; this can be enabled independently from data file cache.enable_external_file_meta_disk_cacheandexternal_file_meta_disk_cache_percent.