Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Ben Gould <ben(dot)gould(at)free(dot)fr>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt
Date: 2005-12-27 03:24:40
Message-ID: 200512270324.jBR3OeK11101@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-patches


Sorry for the delay in responding. I have done research on your bug
report, and the problem seems even worse than you reported. First, a
little background. In non-CSV text mode, the backslash is the escape
character, so any character appearing after a backslash (all 255 of
them) is treated specially, e.g. \{delimiter}, \r, \n, and for our case
here '\.'. (A literal backslash is \\). In CSV mode, the quote is
special, but we don't have 255 special characters after a quote. Only
two double-quotes, "", are special, a literal double-quote.

This behavior gives us problems for specifying the end-of-copy marker,
which is \, in both modes. The big problem is that \. is also a valid
CSV data value (though not a valid non-CSV data value). So, the
solution we came up with was to require \. to appear alone on a line in
CSV mode for it to be treated as end-of-copy. Your idea of using quotes
worked, but it wasn't the right solution. We need to enforce the
alone-on-a-line restriction. Our code had:

if (c == '\\' && cstate->line_buf.len == 0)

The problem with that is the because of the input and _output_
buffering, cstate->line_buf.len could be zero even if we are not on the
first character of a line. In fact, for a typical line, it is zero for
all characters on the line. The proper solution is to introduce a
boolean, first_char_in_line, that we set as we enter the loop and clear
once we process a character.

Looking closer at the code, I see the reason for email comments like
"the copy code is nearing unmaintainability. The CSV/non-CSV code was
already complex, but the buffering additions in 8.1 pushed it over the
edge.

I have restructured the line-reading code in copy.c by:

o merging the CSV/non-CSV functions into a single function
o used macros to centralize and clarify the buffering code
o updated comments
o renamed client_encoding_only to encoding_embeds_ascii
o added a high-bit test to the encoding_embeds_ascii test for
performance
o in CSV mode, allow a backslash followed by a non-period to
continue being processed as a data value

There should be no performance impact from this patch because it is
functionally equivalent. If you apply the patch you will see copy.c is
much clearer in this area now and might suggest additional
optimizations.

I have also attached a 8.1-only patch to fix the CSV \. handling bug
with no code restructuring.

---------------------------------------------------------------------------

Ben Gould wrote:
>
> The following bug has been logged online:
>
> Bug reference: 2114
> Logged by: Ben Gould
> Email address: ben(dot)gould(at)free(dot)fr
> PostgreSQL version: 8.1.0
> Operating system: Mac OS X 10.4.3
> Description: (patch) COPY FROM ... end of copy marker corrupt
> Details:
>
> With a table like:
>
> CREATE TABLE test_table (
> foo text,
> bar text,
> baz text
> );
>
> Using this format for COPY FROM:
>
> COPY test_table FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"'
>
> Where the file was generated via:
>
> COPY test_table TO STDOUT WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"' FORCE QUOTE foo, bar, baz;
>
> I needed this patch:
>
> <<<
> --- postgresql-8.1.0.original/src/backend/commands/copy.c 2005-12-13
> 13:18:16.000000000 +0100
> +++ postgresql-8.1.0/src/backend/commands/copy.c 2005-12-13
> 13:28:28.000000000 +0100
> @@ -2531,7 +2531,7 @@
> /*
> * In CSV mode, we only recognize \. at start of line
> */
> - if (c == '\\' && cstate->line_buf.len == 0)
> + if (c == '\\' && !in_quote && cstate->line_buf.len == 0)
> {
> char c2;
> >>>
>
> Because of this error message:
>
> pg_endcopy warning: ERROR: end-of-copy marker corrupt
>
> (We have quoted strings containing things like ..\..\.. in the CSV file
> which broke the copy from.)
>
> I was using DBD::Pg as the client library.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 35.9 KB
unknown_filename text/plain 4.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Guerra Antonio 2005-12-27 08:28:45 BUG #2133: can't reinstall postgresql
Previous Message kenichi nakanishi 2005-12-26 15:47:36 BUG #2131: SQL Query Bug ?

Browse pgsql-patches by date

  From Date Subject
Next Message Pavel Stehule 2005-12-27 08:29:35 list of scalars for fors and fore stms II
Previous Message Andrew Dunstan 2005-12-26 22:49:29 Re: [PATCHES] default resource limits