When diagnosing problems in storage we often need to look at Parquet metadata. It would be much more productive to have a describe-parquet command to introspect files quickly. Example:
Thoughts on alternatives.
SELECT *
FROM s3('s3://<bucket>/nyc/taxis/data/1751985059861-5526da589bc474f953669f343505ec94f0074bfc701eb046184de439bfe0485f.parquet', ParquetMetaData)
Query id: 49c06e8f-45c4-4ff0-8587-95e7b8f8f8a5
Row 1:
──────
num_columns: 20
num_rows: 3475226 -- 3.48 million
num_row_groups: 1
format_version: 1.0
metadata_size: 4914
total_uncompressed_size: 94235387 -- 94.24 million
total_compressed_size: 56257069 -- 56.26 million
columns: [('VendorID','VendorID',1,0,'INT32','None','GZIP',815756,457781,'43.88%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_pickup_datetime','tpep_pickup_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27244805,13315093,'51.13%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('tpep_dropoff_datetime','tpep_dropoff_datetime',1,0,'INT64','Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)','GZIP',27249800,13630784,'49.98%',['RLE','PLAIN','PLAIN_DICTIONARY','BIT_PACKED']),('passenger_count','passenger_count',1,0,'INT64','None','GZIP',1248716,565309,'54.73%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('trip_distance','trip_distance',1,0,'DOUBLE','None','GZIP',5458535,4879027,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('RatecodeID','RatecodeID',1,0,'INT64','None','GZIP',543574,277677,'48.92%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('store_and_fwd_flag','store_and_fwd_flag',1,0,'BYTE_ARRAY','String','GZIP',35900,30404,'15.31%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('PULocationID','PULocationID',1,0,'INT32','None','GZIP',3559027,2657564,'25.33%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('DOLocationID','DOLocationID',1,0,'INT32','None','GZIP',3906660,3491779,'10.62%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('payment_type','payment_type',1,0,'INT64','None','GZIP',910126,459559,'49.51%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('fare_amount','fare_amount',1,0,'DOUBLE','None','GZIP',5325026,3884220,'27.06%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('extra','extra',1,0,'DOUBLE','None','GZIP',2071927,893528,'56.87%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('mta_tax','mta_tax',1,0,'DOUBLE','None','GZIP',371194,177152,'52.28%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tip_amount','tip_amount',1,0,'DOUBLE','None','GZIP',4929627,3756133,'23.8%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('tolls_amount','tolls_amount',1,0,'DOUBLE','None','GZIP',1825209,405014,'77.81%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('improvement_surcharge','improvement_surcharge',1,0,'DOUBLE','None','GZIP',298522,160411,'46.26%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('total_amount','total_amount',1,0,'DOUBLE','None','GZIP',6513089,6185099,'5.036%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('congestion_surcharge','congestion_surcharge',1,0,'DOUBLE','None','GZIP',556454,303021,'45.54%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('Airport_fee','Airport_fee',1,0,'DOUBLE','None','GZIP',594481,272186,'54.21%',['RLE','PLAIN_DICTIONARY','BIT_PACKED']),('cbd_congestion_fee','cbd_congestion_fee',1,0,'DOUBLE','None','GZIP',776959,455328,'41.4%',['RLE','PLAIN_DICTIONARY','BIT_PACKED'])]
row_groups: [(4,20,3475226,94235387,56257069,[('VendorID','VendorID',457781,815756,true,(3475226,0,NULL,'1','7'),0),('tpep_pickup_datetime','tpep_pickup_datetime',13315093,27244805,true,(3475226,0,NULL,'1735678075000000','1738368044000000'),0),('tpep_dropoff_datetime','tpep_dropoff_datetime',13630784,27249800,true,(3475226,0,NULL,'1734508360000000','1738453451000000'),0),('passenger_count','passenger_count',565309,1248716,true,(2935077,540149,NULL,'0','9'),0),('trip_distance','trip_distance',4879027,5458535,true,(3475226,0,NULL,'0.000000','276423.570000'),0),('RatecodeID','RatecodeID',277677,543574,true,(2935077,540149,NULL,'1','99'),0),('store_and_fwd_flag','store_and_fwd_flag',30404,35900,true,(2935077,540149,NULL,'N','Y'),0),('PULocationID','PULocationID',2657564,3559027,true,(3475226,0,NULL,'1','265'),0),('DOLocationID','DOLocationID',3491779,3906660,true,(3475226,0,NULL,'1','265'),0),('payment_type','payment_type',459559,910126,true,(3475226,0,NULL,'0','5'),0),('fare_amount','fare_amount',3884220,5325026,true,(3475226,0,NULL,'-900.000000','863372.120000'),0),('extra','extra',893528,2071927,true,(3475226,0,NULL,'-7.500000','15.000000'),0),('mta_tax','mta_tax',177152,371194,true,(3475226,0,NULL,'-0.500000','10.500000'),0),('tip_amount','tip_amount',3756133,4929627,true,(3475226,0,NULL,'-86.000000','400.000000'),0),('tolls_amount','tolls_amount',405014,1825209,true,(3475226,0,NULL,'-126.940000','170.940000'),0),('improvement_surcharge','improvement_surcharge',160411,298522,true,(3475226,0,NULL,'-1.000000','1.000000'),0),('total_amount','total_amount',6185099,6513089,true,(3475226,0,NULL,'-901.000000','863380.370000'),0),('congestion_surcharge','congestion_surcharge',303021,556454,true,(2935077,540149,NULL,'-2.500000','2.500000'),0),('Airport_fee','Airport_fee',272186,594481,true,(2935077,540149,NULL,'-1.750000','6.750000'),0),('cbd_congestion_fee','cbd_congestion_fee',455328,776959,true,(3475226,0,NULL,'-0.750000','0.750000'),0)])]
1 row in set. Elapsed: 0.117 sec.
When diagnosing problems in storage we often need to look at Parquet metadata. It would be much more productive to have a describe-parquet command to introspect files quickly. Example:
Thoughts on alternatives.
This command could alternatively be a wrapper around parquet-tools. The
parquet-tools metaoutput is not bad but you have to run it through inconvenient jq commands to get readable output.ClickHouse has pretty good Parquet introspection but it's hard to read. The ice command would have the same information but obtained directly from the file without needing to use ClickHouse. Here's the ClickHouse example: