Skip to content

emp::DataFile output should comply with CSV standard RFC 4180 #489

@mmore500

Description

@mmore500

Is your feature request related to a problem? Please describe.

Serialization through emp::DataFile and deserialization through emp::File are defaulted to work with CSV format, but by default only support a subset of the format.

For example, this file

"a","b","c,d"
"""g""",x,y

should be read as

a b c,d
"g" x y

according to RFC 4180. However, it would currently read as

"a" "b" "c d"
"""g""" x y

Note that in the RFC 4180-compliant version, the quotes around "g" are interpreted as enclosing a single field, making the actual value g.
In the current reading, the quotes are being interpreted literally, so the field reads as """g""".

Describe the solution you'd like
A clear and concise description of what you want to happen.

Probably, for performance reasons, the emp::DataFile and emp::File default behavior should not change.
However, RFC4180 modes or classes should be available.

In debug mode, emp::DataFile/emp::File should probably warn of RFC4180 noncompliance where pertinent.
An easy way to do this would be comparing results with RFC4180-enabled interpretation and warning naive interpretation differs.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Users could currently get part of the way by setting the beginning, separator, and end delimiters to ", "," and ", respectively for serialization.
This delimiter kludge wouldn't work as a deserialization solution because it would fail on plain csv files like

a,b,c
1,2,3 

For serialization, this delimiter kludge would add unnecessary quotes to lots of csv output without properly escaping "'s in output strings as "".

Additional context
Find RFC 4180 here.
The pertinent content is:

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
  1. Fields containing line breaks (CRLF), double quotes, and commas
    should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
  1. If double-quotes are used to enclose fields, then a double-quote
    appearing inside a field must be escaped by preceding it with
    another double quote. For example:
"aaa","b""bb","ccc"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions