Re: Should CSV parsing be stricter about mid-field quotes?

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Greg Stark" <stark(at)mit(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Should CSV parsing be stricter about mid-field quotes?
Date: 2023-05-16 11:43:28
Message-ID: 0f18887d-5113-4584-a295-c6bd0fe313b3@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, May 14, 2023, at 16:58, Andrew Dunstan wrote:
> And if people do follow the method you describe then their input with
> unescaped quotes will be rejected 999 times out of 1000. It's only cases where
> the field happens to have an even number of embedded quotes, like Joel's
> somewhat contrived example, that the input will be accepted.

I concur with Andrew that my previous example might've been somewhat
contrived, as it deliberately included two instances of the term "inches".
It's a matter of time before someone submits a review featuring an odd number of
"inches", leading to an error.

Having done some additional digging, I stumbled upon three instances [1] [2] [3]
where users have misidentified their TSV/TEXT files as CSV. In these situations,
users have been shielded from failure due to an imbalance in quotes.

However, in cases where a field is utilized to store text in which the double
quotation mark is rarely used to denote inches, but instead, for quotation
and/or HTML attributes, it's quite feasible that a large amount of
user-generated text could contain balanced quotes. Even more concerning is the
potential for cases where the vast majority of inputs may not even contain
double quotation marks at all. This would effectively render the issue invisible,
even upon manual data inspection.

Here's a problem scenario that I believe is plausible:

1. The user wishes to import a .TXT file into PostgreSQL.

2. The user examines the .TXT file and observes column headers separated by a
delimiter like TAB or semicolon, with subsequent rows of data also separated by
the same delimiter.

3. The user is familiar with "CSV" (412M hits on Google) but not "TSV" (48M hits
on Google), leading to a false assumption that their file is in CSV format.

4. A Google search for "import csv into postgresql" leads the user to a tutorial
titled "Import CSV File Into PostgreSQL Table". An example found therein:

COPY persons(first_name, last_name, dob, email)
FROM 'C:\sampledb\persons.csv'
DELIMITER ','
CSV HEADER;

5. The user, now confident, believes they understand how to import their "CSV"
file.

6. In contrast to the "ERROR: unterminated CSV quoted field" examples below,
this user's .TXT file contains fields with balanced midfield quote-marks:

blog_posts.txt:
id message
1 This is a <b>bold</b> statement

7. The user copies the COPY command from the tutorial and modifies the file path
and delimiter accordingly. The user then concludes that the code is functioning
as expected and proceeds to deploy it.

8, Weeks later, users complain about broken links in their blog posts. Upon
inspection of the blog_posts table, the user identifies an issue:

SELECT * FROM blog_posts;
id | message
----+------------------------------------------------------------------------
1 | This is a <b>bold</b> statement
2 | Check <a href=http://example.com/?param1=Midfield quoting>this</a> out
(2 rows)

One of the users has used balanced quotes for the href attribute, which was
imported successfully but the quotes were stripped, contrary to the intention of
preserving them.

Content of blog_posts.txt:
id message
1 This is a <b>bold</b> statement
2 Check <a href="http://example.com/?param1=Midfield quoting">this</a> out

If we made midfield quoting a CSV error, those users who are currently mistaken
about their TSV/TEXT files being CSV while also having balanced quotes in their
data, would encounter an error rather than a silent failure, which I believe
would be an enhancement.

/Joel

[1] https://www.postgresql.org/message-id/1upfg19cru2jigbm553fugj5k6iebtd4ps@4ax.com
[2] https://stackoverflow.com/questions/44108286/unterminated-csv-quoted-field-in-postgres
[3] https://dba.stackexchange.com/questions/306662/unterminated-csv-quoted-field-when-to-import-csv-data-file-into-postgresql

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-05-16 12:16:24 Re: cutting down the TODO list thread
Previous Message Tomas Vondra 2023-05-16 10:01:51 Re: Memory leak from ExecutorState context?