Skip to content

When reading in CSV data, do not cut off any leading 0s#2786

Merged
fealho merged 7 commits intomainfrom
issue-2780-trailing-zeros
Jan 30, 2026
Merged

When reading in CSV data, do not cut off any leading 0s#2786
fealho merged 7 commits intomainfrom
issue-2780-trailing-zeros

Conversation

@fealho
Copy link
Copy Markdown
Member

@fealho fealho commented Jan 24, 2026

CU-86b85x0qf, Resolve #2780

@sdv-team
Copy link
Copy Markdown
Contributor

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 24, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.10%. Comparing base (c8238e9) to head (53ef480).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sdv/io/local/local.py 90.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2786      +/-   ##
==========================================
- Coverage   98.13%   98.10%   -0.04%     
==========================================
  Files          73       73              
  Lines        8079     8107      +28     
==========================================
+ Hits         7928     7953      +25     
- Misses        151      154       +3     
Flag Coverage Δ
integration 75.57% <73.33%> (-0.02%) ⬇️
unit 97.13% <90.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fealho fealho force-pushed the issue-2780-trailing-zeros branch 2 times, most recently from 0582027 to 144f73a Compare January 26, 2026 18:01
@fealho fealho force-pushed the issue-2780-trailing-zeros branch from 004f20d to df8dc4c Compare January 26, 2026 18:33
@fealho fealho marked this pull request as ready for review January 26, 2026 19:19
@fealho fealho requested a review from a team as a code owner January 26, 2026 19:19
Comment thread tests/unit/io/local/test_local.py
Comment thread tests/unit/io/local/test_local.py
@fealho fealho requested a review from gsheni January 26, 2026 23:46
Comment thread sdv/io/local/local.py
@fealho fealho requested a review from pvk-developer January 30, 2026 02:13
Copy link
Copy Markdown
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are still reading the file twice;

I think we can do something like this:

df = pd.read_csv(path, dtype=str, low_memory=False)
has_leading_zero = (
    df.apply(
        lambda col: col.str.match(r"^0\d+").any()
    )
)
for col in df.columns:
    if not has_leading_zero[col]:
        converted = pd.to_numeric(df[col], errors="ignore")
        if not converted.equals(df[col]):
            df[col] = converted

I tested this with 1M rows * 100 columns and it takes ~1minute 10 sec covering all records, while your approach took ~1minute 10 sec and covered only 100k rows per column, which for some edge cases did not work.

Here is the data I used:

import numpy as np
import pandas as pd

def generate_mixed_zero_df(
    n_rows=1_000_000,
    n_cols=100,
    width=8,
    last_n_padded=100_000,
    n_mixed_cols=10,
    seed=42
):
    rng = np.random.default_rng(seed)
    base_vals = np.arange(1, n_rows + 1) % (10**width)

    data = {}

    mixed_cols = set(rng.choice(n_cols, size=n_mixed_cols, replace=False))

    for i in range(n_cols):
        col_name = f"col_{i}"
        vals = base_vals.copy()
        if i in mixed_cols:
            first_part = np.char.ljust(
                vals[: n_rows - last_n_padded].astype(str),
                width,
                fillchar="0"
            )
            last_part = np.char.zfill(
                vals[n_rows - last_n_padded :].astype(str),
                width
            )

            data[col_name] = np.concatenate([first_part, last_part])
        else:
            data[col_name] = np.char.zfill(vals.astype(str), width)

    df = pd.DataFrame(data)

    return df, sorted(mixed_cols)


df, mixed_columns = generate_mixed_zero_df(
    n_mixed_cols=15
)

print("Mixed columns:", mixed_columns)
print(df.head())
print(df.tail())

Also, I think this approach could be used then in the download_demo functionality to load the leading zeros from the demo datasets.

@fealho fealho merged commit 17ea19f into main Jan 30, 2026
47 checks passed
@fealho fealho deleted the issue-2780-trailing-zeros branch January 30, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When reading in CSV data, do not cut off any leading 0s

4 participants