Re: Bug in COPY FROM backslash escaping multi-byte chars

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: john(dot)naylor(at)enterprisedb(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in COPY FROM backslash escaping multi-byte chars
Date: 2021-02-04 19:37:54
Message-ID: c78113ac-8c25-5892-1269-d174d11e3a9f@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/02/2021 03:50, Kyotaro Horiguchi wrote:
> At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
>> On 03/02/2021 15:38, John Naylor wrote:
>>> On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi
>>> <mailto:hlinnaka(at)iki(dot)fi>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > While playing with COPY FROM refactorings in another thread, I noticed
>>> > corner case where I think backslash escaping doesn't work correctly.
>>> > Consider the following input:
>>> >
>>> > \么.foo
>>> I've seen multibyte delimiters in the wild, so it's not as outlandish
>>> as it seems.
>>
>> We don't actually support multi-byte characters as delimiters or quote
>> or escape characters:
>>
>> postgres=# copy copytest from 'foo' with (delimiter '么');
>> ERROR: COPY delimiter must be a single one-byte character
>>
>>> The fix is simple enough, so +1.
>>
>> Thanks, I'll commit and backpatch shortly.
>
> I'm not sure the assumption in the second hunk always holds, but
> that's fine at least with Shift-JIS and -2004 since they are two-byte
> encoding.

The assumption is that a multi-byte character cannot have a special
meaning, as far as the loop in CopyReadLineText is concerned. The
characters with special meaning are '\\', '\n' and '\r'. That hold
regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly
better. Normally in the loop, raw_buf_ptr points to the next byte to
consume, and 'c' is the last consumed byte. At the end of the loop, we
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd
and 4th byte if necessary. The crux of the bug is that after the
"raw_buf_ptr++;" to skip the character after the backslash, we left c to
'\\', even though we already consumed the first byte of the next
character. Because of that, the end-of-the-loop check didn't correctly
treat it as a multi-byte character. So a more straightforward fix is to
set 'c' to the byte we skipped over.

- Heikki

Attachment Content-Type Size
v2-0001-Fix-a-corner-case-in-COPY-FROM-backslash-processi.patch text/x-patch 2.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-02-04 19:54:59 Re: logical replication worker accesses catalogs in error context callback
Previous Message Bruce Momjian 2021-02-04 18:49:32 Re: Multiple full page writes in a single checkpoint?