Re: Should CSV parsing be stricter about mid-field quotes?

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-14 14:58:37
Message-ID: e06241ae-3b49-b3fd-1760-4ecf55cb6429@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-05-13 Sa 23:11, Greg Stark wrote:
> On Sat, 13 May 2023 at 09:46, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>>> I could see an argument for a STRICT mode which would disallow partially
>>> quoted fields, although I'd like some evidence that we're dealing with a
>>> real problem here. Is there really a CSV producer that produces output
>>> like that you showed in your example? And if so has anyone objected to
>>> them about the insanity of that?
>> I think you'd want not just "some evidence" but "compelling evidence".
>> Any such option is going to add cycles into the low-level input parser
>> for COPY, which we know is a hot spot and we've expended plenty of
>> sweat on. Adding a speed penalty that will be paid by the 99.99%
>> of users who don't have an issue here is going to be a hard sell.
> Well I'm not sure that follows. Joel specifically claimed that an
> implementation that didn't accept inputs like this would actually be
> simpler and that might mean it would actually be faster.
>
> And I don't think you have to look very hard for inputs like this --
> plenty of people generate CSV files from simple templates or script
> outputs that don't understand escaping quotation marks at all. Outputs
> like that will be fine as long as there's no doublequotes in the
> inputs but then one day someone will enter a doublequote in a form
> somewhere and blammo.

The procedure described is plain wrong, and I don't have too much
sympathy for people who implement it. Parsing CSV files might be a mild
PITN, but constructing them is pretty darn simple. Something like this
perl fragment should do it:

do {
  s/"/""/g;
  $_ = qq{"$_"} if /[",\r\n]/;
} foreach @fields;
print join(',',@fields),"\n";

And if people do follow the method you describe then their input with
unescaped quotes will be rejected 999 times out of 1000. It's only cases
where the field happens to have an even number of embedded quotes, like
Joel's somewhat contrived example, that the input will be accepted.

> So I guess the real question is whether accepting inputs with
> unescapted quotes and interpreting them the way we do is really the
> best interpretation. Is the user best served by a) assuming they
> intended to quote part of the field and not quote part of it b) assume
> they failed to escape the quotation mark or c) assume something's gone
> wrong and the input is entirely untrustworthy.
>

As I said earlier, I'm quite reluctant to break things that might have
been working happily for people for many years, in order to accommodate
people who can't do the minimum required to produce correct CSVs. I have
no idea how many are relying on it, but I would be slightly surprised if
the number were zero.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-05-14 20:59:46 Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing
Previous Message Kirk Wolak 2023-05-14 06:20:07 Re: Adding SHOW CREATE TABLE