Should CSV parsing be stricter about mid-field quotes?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-11 14:03:33
Message-ID: 8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.

% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"

% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;

This gives:

id | rating | review
----+--------+-------------------------------------------------------------
1 | 5 | Great product, will buy again.
2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)

The parser currently accepts quoting within an unquoted field. This can lead to
data misinterpretation when the quote is part of the field data (e.g.,
for inches, like in the example).

Our CSV output rules quote an entire field or not at all. But the import of
fields with mid-field quotes might lead to surprising and undetected outcomes.

I think we should throw a parsing error for unescaped mid-field quotes,
and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
quotes are necessary. The error message could suggest this option when it
encounters an unescaped mid-field quote.

I think the convenience of not having to use an extra option doesn't outweigh
the risk of undetected data integrity issues.

Thoughts?

/Joel

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-05-11 14:07:23 Re: createuser --memeber and PG 16
Previous Message Tom Lane 2023-05-11 13:56:59 Re: v16 regression - wrong query results with LEFT JOINs + join removal