From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Subject: | Bug in COPY FROM backslash escaping multi-byte chars |
Date: | 2021-02-03 12:08:37 |
Message-ID: | a897f84f-8dca-8798-3139-07da5bb38728@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:
\么.foo
I hope that came through in this email correctly as UTF-8. The string
contains a sequence of: backslash, multibyte-character and a dot.
The documentation says:
> Backslash characters (\) can be used in the COPY data to quote data
> characters that might otherwise be taken as row or column delimiters
So I believe escaping multi-byte characters is supposed to work, and it
usually does.
However, let's consider the same string in Big5 encoding (in hex escaped
format):
\x5ca45c2e666f6f
The first byte 0x5c, is the backslash. The multi-byte character consists
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.
That confuses the parser in CopyReadLineText, so that you get an error:
postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR: end-of-copy marker corrupt
CONTEXT: COPY copytest, line 1
What happens is that when the parser sees the backslash, it looks ahead
at the next byte, and when it's not a dot, it skips over it:
> else if (!cstate->opts.csv_mode)
>
> /*
> * If we are here, it means we found a backslash followed by
> * something other than a period. In non-CSV mode, anything
> * after a backslash is special, so we skip over that second
> * character too. If we didn't do that \\. would be
> * considered an eof-of copy, while in non-CSV mode it is a
> * literal backslash followed by a period. In CSV mode,
> * backslashes are not special, so we want to process the
> * character after the backslash just like a normal character,
> * so we don't increment in those cases.
> */
> raw_buf_ptr++;
However, in a multi-byte encoding that might "embed" ascii characters,
it should skip over the next *character*, not byte.
Attached is a pretty straightforward patch to fix that. Anyone see a
problem with this?
- Heikki
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-a-corner-case-in-COPY-FROM-backslash-processing.patch | text/x-patch | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2021-02-03 12:10:51 | pg_dump: Add const decorations |
Previous Message | Peter Eisentraut | 2021-02-03 12:04:09 | Re: Dumping/restoring fails on inherited generated column |