Bug in COPY FROM backslash escaping multi-byte chars

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Subject: Bug in COPY FROM backslash escaping multi-byte chars
Date: 2021-02-03 12:08:37
Message-ID: a897f84f-8dca-8798-3139-07da5bb38728@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I hope that came through in this email correctly as UTF-8. The string
contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it
usually does.

However, let's consider the same string in Big5 encoding (in hex escaped
format):

\x5ca45c2e666f6f

The first byte 0x5c, is the backslash. The multi-byte character consists
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR: end-of-copy marker corrupt
CONTEXT: COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead
at the next byte, and when it's not a dot, it skips over it:

> else if (!cstate->opts.csv_mode)
>
> /*
> * If we are here, it means we found a backslash followed by
> * something other than a period. In non-CSV mode, anything
> * after a backslash is special, so we skip over that second
> * character too. If we didn't do that \\. would be
> * considered an eof-of copy, while in non-CSV mode it is a
> * literal backslash followed by a period. In CSV mode,
> * backslashes are not special, so we want to process the
> * character after the backslash just like a normal character,
> * so we don't increment in those cases.
> */
> raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters,
it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a
problem with this?

- Heikki

Attachment Content-Type Size
0001-Fix-a-corner-case-in-COPY-FROM-backslash-processing.patch text/x-patch 2.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-02-03 12:10:51 pg_dump: Add const decorations
Previous Message Peter Eisentraut 2021-02-03 12:04:09 Re: Dumping/restoring fails on inherited generated column