From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Subject: Bug in COPY FROM backslash escaping multi-byte chars
Date: 2021-02-03 12:08:37
Message-ID: a897f84f-8dca-8798-3139-07da5bb38728@iki.fi
Lists: pgsql-hackers


While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:


I hope that came through in this email correctly as UTF-8. The string
contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it
usually does.

However, let's consider the same string in Big5 encoding (in hex escaped


The first byte 0x5c, is the backslash. The multi-byte character consists
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR: end-of-copy marker corrupt
CONTEXT: COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead
at the next byte, and when it's not a dot, it skips over it:

> else if (!cstate->opts.csv_mode)
> /*
> * If we are here, it means we found a backslash followed by
> * something other than a period. In non-CSV mode, anything
> * after a backslash is special, so we skip over that second
> * character too. If we didn't do that \\. would be
> * considered an eof-of copy, while in non-CSV mode it is a
> * literal backslash followed by a period. In CSV mode,
> * backslashes are not special, so we want to process the
> * character after the backslash just like a normal character,
> * so we don't increment in those cases.
> */
> raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters,
it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a
problem with this?

- Heikki

