When reading in CSV data, do not cut off any leading 0s by fealho · Pull Request #2786 · sdv-dev/SDV

fealho · 2026-01-24T03:08:38Z

CU-86b85x0qf, Resolve #2780

sdv-team · 2026-01-24T03:08:42Z

Task linked: CU-86b85x0qf SDV - When reading in CSV data, do not cut off any leading 0s #2780

codecov · 2026-01-24T03:12:07Z

Codecov Report

❌ Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.10%. Comparing base (c8238e9) to head (53ef480).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
sdv/io/local/local.py	90.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2786      +/-   ##
==========================================
- Coverage   98.13%   98.10%   -0.04%     
==========================================
  Files          73       73              
  Lines        8079     8107      +28     
==========================================
+ Hits         7928     7953      +25     
- Misses        151      154       +3

Flag	Coverage Δ
integration	`75.57% <73.33%> (-0.02%)`	⬇️
unit	`97.13% <90.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pvk-developer

You are still reading the file twice;

I think we can do something like this:

df = pd.read_csv(path, dtype=str, low_memory=False)
has_leading_zero = (
    df.apply(
        lambda col: col.str.match(r"^0\d+").any()
    )
)
for col in df.columns:
    if not has_leading_zero[col]:
        converted = pd.to_numeric(df[col], errors="ignore")
        if not converted.equals(df[col]):
            df[col] = converted

I tested this with 1M rows * 100 columns and it takes ~1minute 10 sec covering all records, while your approach took ~1minute 10 sec and covered only 100k rows per column, which for some edge cases did not work.

Here is the data I used:

import numpy as np
import pandas as pd

def generate_mixed_zero_df(
    n_rows=1_000_000,
    n_cols=100,
    width=8,
    last_n_padded=100_000,
    n_mixed_cols=10,
    seed=42
):
    rng = np.random.default_rng(seed)
    base_vals = np.arange(1, n_rows + 1) % (10**width)

    data = {}

    mixed_cols = set(rng.choice(n_cols, size=n_mixed_cols, replace=False))

    for i in range(n_cols):
        col_name = f"col_{i}"
        vals = base_vals.copy()
        if i in mixed_cols:
            first_part = np.char.ljust(
                vals[: n_rows - last_n_padded].astype(str),
                width,
                fillchar="0"
            )
            last_part = np.char.zfill(
                vals[n_rows - last_n_padded :].astype(str),
                width
            )

            data[col_name] = np.concatenate([first_part, last_part])
        else:
            data[col_name] = np.char.zfill(vals.astype(str), width)

    df = pd.DataFrame(data)

    return df, sorted(mixed_cols)


df, mixed_columns = generate_mixed_zero_df(
    n_mixed_cols=15
)

print("Mixed columns:", mixed_columns)
print(df.head())
print(df.tail())

Also, I think this approach could be used then in the download_demo functionality to load the leading zeros from the demo datasets.

This reverts commit cef4841.

fealho force-pushed the issue-2780-trailing-zeros branch 2 times, most recently from 0582027 to 144f73a Compare January 26, 2026 18:01

Add keep_leading_zeros

df8dc4c

fealho force-pushed the issue-2780-trailing-zeros branch from 004f20d to df8dc4c Compare January 26, 2026 18:33

Merge branch 'main' into issue-2780-trailing-zeros

afda4c0

fealho requested review from gsheni and pvk-developer January 26, 2026 19:19

fealho marked this pull request as ready for review January 26, 2026 19:19

fealho requested a review from a team as a code owner January 26, 2026 19:19

auto-assign Bot assigned fealho Jan 26, 2026

gsheni reviewed Jan 26, 2026

View reviewed changes

Comment thread tests/unit/io/local/test_local.py

Comment thread tests/unit/io/local/test_local.py

Add ddtypes to tests

f69028b

fealho requested a review from gsheni January 26, 2026 23:46

gsheni approved these changes Jan 27, 2026

View reviewed changes

pvk-developer reviewed Jan 28, 2026

View reviewed changes

Comment thread sdv/io/local/local.py

fealho requested a review from pvk-developer January 30, 2026 02:13

Make it faster

cef4841

pvk-developer requested changes Jan 30, 2026

View reviewed changes

fealho added 3 commits January 30, 2026 07:59

Revert "Make it faster"

a252b35

This reverts commit cef4841.

Add bounding to initial commit

4c6d717

Merge branch 'main' into issue-2780-trailing-zeros

53ef480

pvk-developer approved these changes Jan 30, 2026

View reviewed changes

fealho merged commit 17ea19f into main Jan 30, 2026
47 checks passed

fealho deleted the issue-2780-trailing-zeros branch January 30, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When reading in CSV data, do not cut off any leading 0s#2786

When reading in CSV data, do not cut off any leading 0s#2786
fealho merged 7 commits intomainfrom
issue-2780-trailing-zeros

fealho commented Jan 24, 2026

Uh oh!

sdv-team commented Jan 24, 2026

Uh oh!

codecov Bot commented Jan 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvk-developer left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fealho commented Jan 24, 2026

Uh oh!

sdv-team commented Jan 24, 2026

Uh oh!

codecov Bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvk-developer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jan 24, 2026 •

edited

Loading

pvk-developer left a comment •

edited

Loading