Re: backslash-dot quoting in COPY CSV

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Michael Paquier" <michael(at)paquier(dot)xyz>,pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: backslash-dot quoting in COPY CSV
Date: 2019-01-30 17:32:11
Message-ID: 4a6bb1d2-8cc9-4e52-9060-6e7c1a24579b@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian wrote:

> > - the end of data could be expressed as a length (in number of lines
> > for instance) instead of an in-data marker.
> >
> > - the end of data could be configurable, as in the MIME structure of
> > multipart mail messages, where a part is ended by a "boundary",
> > line, generally a long randomly generated string. This boundary
> > would have to be known to psql through setting a dedicated
> > variable or command.
> >
> > - COPY as the SQL command could have the boundary option
> > for data fed through its STDIN. This could neutralize the
> > special role of backslash-dot in general, not just in quoted fields,
> > since the necessity to quote backslash-dot is a wart anyway.
>
> Well, these all kind of require a change to the COPY format, which
> hasn't changed in many years.

Not for the first two. As an example of solution #2, it could look like this:

=# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
=# COPY table FROM STDIN CSV;
somevalue,"foo
\.
bar"
==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==

Instead of looking for \. on a line by itself, psql would look for the
boundary to know where the data ends.
The boundary is not transmitted to the server, it has no need to know
about it.

> > - psql could be told somehow that the next piece of inline data is in
> > the CSV format, and then pass it through a CSV parser.
>
> That might be the cleanest solution, but how would we actually input
> multi-line data in CSV mode with \. alone on a line?

With this solution, the content doesn't change at all.
The weird part would be the user interface, because the information
psql needs is not only "CSV", it's also the options DELIMITER, QUOTE,
ESCAPE and possibly ENCODING. Currently it doesn't know any of these,
they're passed to the server in an opaque, unparsed form within
the COPY command.

Personally, the solution I find cleaner is the server side not having
any end-of-data marker for CSV. So backslash-dot would never be
special. psql could allow for a custom ending boundary for in-script
data, and users could set that to backslash-dot if they want, but that
would be their choice.
That would be clearly not backward compatible, and I believe it wouldn't
work with the v2 protocol, so I'm not sure it would have much chance of
approval.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2019-01-30 17:45:58 Re: A few new options for vacuumdb
Previous Message Amit Langote 2019-01-30 17:28:38 Re: speeding up planning with partitions