Skip to content

Commit 0d2aa43

Browse files
committed
chore(ci): use tpchgen-cli for generating the tpch dataset
closes #1120
1 parent 675e41e commit 0d2aa43

File tree

3 files changed

+4
-25
lines changed

3 files changed

+4
-25
lines changed

.github/workflows/test.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ jobs:
133133
- name: Run dbgen to create 1 Gb dataset
134134
if: ${{ steps.cache-tpch-dataset.outputs.cache-hit != 'true' }}
135135
run: |
136+
uv tool install tpchgen-cli
136137
cd benchmarks/tpch
137138
RUN_IN_CI=TRUE ./tpch-gen.sh 1
138139

benchmarks/tpch/tpch-gen.sh

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -34,21 +34,11 @@ fi
3434
#popd
3535

3636
# Generate data into the ./data directory if it does not already exist
37-
FILE=./data/supplier.tbl
37+
FILE=./data/supplier.csv
3838
if test -f "$FILE"; then
3939
echo "$FILE exists."
4040
else
41-
docker run -v `pwd`/data:/data $TERMINAL_FLAG --rm ghcr.io/scalytics/tpch-docker:main $VERBOSE_OUTPUT -s $1
42-
43-
# workaround for https://github.com/apache/arrow-datafusion/issues/6147
44-
mv data/customer.tbl data/customer.csv
45-
mv data/lineitem.tbl data/lineitem.csv
46-
mv data/nation.tbl data/nation.csv
47-
mv data/orders.tbl data/orders.csv
48-
mv data/part.tbl data/part.csv
49-
mv data/partsupp.tbl data/partsupp.csv
50-
mv data/region.tbl data/region.csv
51-
mv data/supplier.tbl data/supplier.csv
41+
tpchgen-cli -s $1 --format=csv --output-dir=./data
5242

5343
ls -l data
5444
fi

examples/tpch/convert_data_to_parquet.py

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -121,22 +121,10 @@
121121
# For convenience, go ahead and convert the schema column names to lowercase
122122
curr_schema = [(s[0].lower(), s[1]) for s in curr_schema_val]
123123

124-
# Pre-collect the output columns so we can ignore the null field we add
125-
# in to handle the trailing | in the file
126-
output_cols = [r[0] for r in curr_schema]
127-
128-
curr_schema = [pa.field(r[0], r[1], nullable=False) for r in curr_schema]
129-
130-
# Trailing | requires extra field for in processing
131-
curr_schema.append(("some_null", pa.null()))
132-
133124
schema = pa.schema(curr_schema)
134125

135126
source_file = (curr_dir / f"../../benchmarks/tpch/data/{filename}.csv").resolve()
136127
dest_file = (curr_dir / f"./data/{filename}.parquet").resolve()
137128

138-
df = ctx.read_csv(source_file, schema=schema, has_header=False, delimiter="|")
139-
140-
df = df.select(*output_cols)
141-
129+
df = ctx.read_csv(source_file, schema=schema, has_header=True)
142130
df.write_parquet(dest_file, compression="snappy")

0 commit comments

Comments
 (0)