Re: Should CSV parsing be stricter about mid-field quotes?

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Kirk Wolak <wolakk(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-18 06:35:26
Message-ID: CAFj8pRBPPfmL+xhBmZha+OAyJO2zXj+28RFPJdd2wS2+pfZc_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

čt 18. 5. 2023 v 8:01 odesílatel Joel Jacobson <joel(at)compiler(dot)org> napsal:

> On Thu, May 18, 2023, at 00:18, Kirk Wolak wrote:
> > Here you go. Not horrible handling. (I use DataGrip so I saved it from
> there
> > directly as TSV, just for an extra datapoint).
> >
> > FWIW, if you copy/paste in windows, the data, the field with the tab gets
> > split into another column in Excel. But saving it as a file, and opening
> it.
> > Saving it as XLSX, and then having Excel save it as a TSV (versus
> opening a
> > text file, and saving it back)
>
> Very useful, thanks.
>
> Interesting, DataGrip contrary to Excel doesn't quote fields with commas
> in TSV.
> All the DataGrip/Excel TSV variants uses quoting when necessary,
> contrary to Google Sheets's TSV-format, that doesn't quote fields at all.
>

Maybe there is another third implementation in Libre Office.

Generally TSV is not well specified, and then the implementations are not
consistent.

>
> DataGrip/Excel terminate also the last record with newline,
> while Google Sheets omit the newline for the last record,
> (which is bad, since then a streaming reader wouldn't know
> if the last record is completed or not.)
>
> This makes me think we probably shouldn't add a new TSV format,
> since there is no consistency between vendors.
> It's impossible to deduce with certainty if a TSV-field that
> begins with a double quotation mark is quoted or unquoted.
>
> Two alternative ideas:
>
> 1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in conjunction
> with `COPY ... WITH CSV`?
>
> Internally, it would just set
>
> quotec = '\0';`
>
> so it would't affect performance at all.
>
> 2. How about adding a note on the complexities of dealing with TSV files
> in the
> COPY documentation?
>
> /Joel
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2023-05-18 06:37:43 Re: Assert failure of the cross-check for nullingrels
Previous Message Joel Jacobson 2023-05-18 06:19:24 Re: Should CSV parsing be stricter about mid-field quotes?