Skip to content

Missing type gets lost when writing partitions of DataFrame #403

@svilupp

Description

@svilupp

This is an odd one and likely to be a PICNIC...

Problem: Missigness in a string column is lost after saving/loading arrow file

When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).

Debugging:

  • It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
  • Only when partitioned as Iterators.partition(Tables.rows(df), 2). If partitioned as Iterators.partition(df,2) available from version >1.5.0, it is fine
  • If missing type appears in the first partition, it's fine
  • Validity bitmap is written correctly
  • But field is marked as not-nullable (!)

┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
--- in correct cases, this appears
┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486

MWE

using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame

# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}

# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:

# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}

# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}

Versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

Arrow: 2.4.3 on main branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions