Quick Links

Almost bug in COPY FROM processing of GB18030 encoded input

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Almost bug in COPY FROM processing of GB18030 encoded input
Date:	2019-01-23 11:23:23
Message-ID:	7704d099-9643-2a55-fb0e-becd64400dcb@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I happened to notice that when CopyReadLineText() calls mblen(), it
passes only the first byte of the multi-byte characters. However,
pg_gb18030_mblen() looks at the first and the second byte.
CopyReadLineText() always passes \0 as the second byte, so
pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
characters as 2.

It works out fine, though, because the second half of the 4-byte encoded
character always looks like another 2-byte encoded character, in
GB18030. CopyReadLineText() is looking for delimiter and escape
characters and newlines, and only single-byte characters are supported
for those, so treating a 4-byte character as two 2-byte characters is
harmless.

Attached is a patch to explain that in the comments. Grepping for
mblen(), I didn't find any other callers that used mblen() like that.

- Heikki

Attachment	Content-Type	Size
0001-Fix-comments-to-that-claimed-that-mblen-only-looks-a.patch	text/x-patch	4.0 KB

Responses

Re: Almost bug in COPY FROM processing of GB18030 encoded input at 2019-01-24 21:27:11 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Etsuro Fujita	2019-01-23 11:35:15	Re: postgres_fdw: oddity in costing aggregate pushdown paths
Previous Message	Chris Travers	2019-01-23 10:55:09	Re: Proposal for Signal Detection Refactoring