Skip to content

Support for :disable-na-as-missing? with :parser-fn option #460

@stevenproctor

Description

@stevenproctor

We found the PR #399 that adds support for the option :disable-na-as-missing?, but found that if we do not want to support promotion, the option is not used in the FixedTypeParser.

We are using tech.ml.dataset to read CSV files in and do validation processing on the values, so we would like to keep everything as strings and have promotion be done at a later point, so we have been using the :parser-fn :string option.

We were hoping to be able to take advantage of the :disable-na-as-missing? to keep NAs as untouched as well after finding that option is there for the promotional parser.

Sample of the difference between parser-type and parser-fn option keys where both have the value of :string.

user=>        (spit "/tmp/foo.csv" "a,b\n9,1\n8,2\n7,3\n6,NA\n5,5")
nil
user=>        (ds/->dataset "/tmp/foo.csv" {:parser-fn :string})
/tmp/foo.csv [5 2]:

| a | b |
|---|---|
| 9 | 1 |
| 8 | 2 |
| 7 | 3 |
| 6 |   |
| 5 | 5 |

user=>        (map meta (vals *1))
({:categorical? true, :name "a", :datatype :string, :n-elems 5} {:categorical? true, :name "b", :datatype :string, :n-elems 5})
user=>        (ds/missing *2)
{3}
user=>        (ds/->dataset "/tmp/foo.csv" {:parser-type :string :disable-na-as-missing? true})
/tmp/foo.csv [5 2]:

| a |  b |
|--:|----|
| 9 |  1 |
| 8 |  2 |
| 7 |  3 |
| 6 | NA |
| 5 |  5 |

user=>        (map meta (vals *1))
({:name "a", :datatype :int16, :n-elems 5} {:categorical? true, :name "b", :datatype :string, :n-elems 5})
user=>        (ds/missing *2)
{}
user=>        (ds/->dataset "/tmp/foo.csv" {:parser-fn :string :disable-na-as-missing? true})
/tmp/foo.csv [5 2]:

| a | b |
|---|---|
| 9 | 1 |
| 8 | 2 |
| 7 | 3 |
| 6 |   |
| 5 | 5 |

user=>        (map meta (vals *1))
({:categorical? true, :name "a", :datatype :string, :n-elems 5} {:categorical? true, :name "b", :datatype :string, :n-elems 5})
user=>        (ds/missing *2)
{3}

Looking at the PR linked above, the :disable-na-as-missing? option was inspected for the Promotional and Object types but not the Fixed, so wondering if there are objections to doing the same for the Fixed type.

Not sure if it should also to be guarded to take effect only when the :parser-fn value is :string, or if there are any other caveats or things to think about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions