Re: Should CSV parsing be stricter about mid-field quotes?

From: Greg Stark <stark(at)mit(dot)edu>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-12 18:58:31
Message-ID: CAM-w4HOEwz13f2aekQAORq+K7KFCO_iVEg3v0NtjZQ1XQgRhXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 11 May 2023 at 10:04, Joel Jacobson <joel(at)compiler(dot)org> wrote:
>
> The parser currently accepts quoting within an unquoted field. This can lead to
> data misinterpretation when the quote is part of the field data (e.g.,
> for inches, like in the example).

I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".

So for example, in that world you might only quote commas and newlines
so you might print something like

1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop

The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.

But it also says

Due to lack of a single specification, there are considerable
differences among implementations. Implementors should "be
conservative in what you do, be liberal in what you accept from
others" (RFC 793 [8]) when processing CSV files. An attempt at a
common definition can be found in Section 2.

So the real question is are there tools out there that generate
entries like this and what are their intentions?

> I think we should throw a parsing error for unescaped mid-field quotes,
> and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
> quotes are necessary. The error message could suggest this option when it
> encounters an unescaped mid-field quote.
>
> I think the convenience of not having to use an extra option doesn't outweigh
> the risk of undetected data integrity issues.

It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.

--
greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-05-12 19:08:47 Re: psql tests hangs
Previous Message Nathaniel Sabanski 2023-05-12 18:58:24 Re: Adding SHOW CREATE TABLE