When reading in CSV data, do not cut off any leading 0s#2786
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2786 +/- ##
==========================================
- Coverage 98.13% 98.10% -0.04%
==========================================
Files 73 73
Lines 8079 8107 +28
==========================================
+ Hits 7928 7953 +25
- Misses 151 154 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0582027 to
144f73a
Compare
004f20d to
df8dc4c
Compare
There was a problem hiding this comment.
You are still reading the file twice;
I think we can do something like this:
df = pd.read_csv(path, dtype=str, low_memory=False)
has_leading_zero = (
df.apply(
lambda col: col.str.match(r"^0\d+").any()
)
)
for col in df.columns:
if not has_leading_zero[col]:
converted = pd.to_numeric(df[col], errors="ignore")
if not converted.equals(df[col]):
df[col] = convertedI tested this with 1M rows * 100 columns and it takes ~1minute 10 sec covering all records, while your approach took ~1minute 10 sec and covered only 100k rows per column, which for some edge cases did not work.
Here is the data I used:
import numpy as np
import pandas as pd
def generate_mixed_zero_df(
n_rows=1_000_000,
n_cols=100,
width=8,
last_n_padded=100_000,
n_mixed_cols=10,
seed=42
):
rng = np.random.default_rng(seed)
base_vals = np.arange(1, n_rows + 1) % (10**width)
data = {}
mixed_cols = set(rng.choice(n_cols, size=n_mixed_cols, replace=False))
for i in range(n_cols):
col_name = f"col_{i}"
vals = base_vals.copy()
if i in mixed_cols:
first_part = np.char.ljust(
vals[: n_rows - last_n_padded].astype(str),
width,
fillchar="0"
)
last_part = np.char.zfill(
vals[n_rows - last_n_padded :].astype(str),
width
)
data[col_name] = np.concatenate([first_part, last_part])
else:
data[col_name] = np.char.zfill(vals.astype(str), width)
df = pd.DataFrame(data)
return df, sorted(mixed_cols)
df, mixed_columns = generate_mixed_zero_df(
n_mixed_cols=15
)
print("Mixed columns:", mixed_columns)
print(df.head())
print(df.tail())Also, I think this approach could be used then in the download_demo functionality to load the leading zeros from the demo datasets.
CU-86b85x0qf, Resolve #2780