tsearch is non-multibyte-aware in a few places

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Giorgio Valoti <giorgio_v(at)mac(dot)com>
Subject: tsearch is non-multibyte-aware in a few places
Date: 2008-06-19 16:29:11
Message-ID: 15580.1213892951@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've identified the cause of bug #4253:

/* Trim trailing space */
while (*pbuf && !t_isspace(pbuf))
pbuf++;
*pbuf = '\0';

At least on Macs, t_isspace is capable of returning "true" when pointed
at the second byte of a 2-byte UTF8 character. This explains the report
that the letter "" has a problem when some other ones don't. Of
course pbuf needs to be incremented using pg_mblen not just ++.

I looked around for other occurrences of the same problem and found
a couple. I also found occurrences of the same pattern for skipping
whitespace:

while (*s && t_isspace(s))
s++;

This is safe if and only if t_isspace is never true for multibyte
characters ... can anyone think of a counterexample?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-06-19 17:23:36 Re: tsearch is non-multibyte-aware in a few places
Previous Message Alvaro Herrera 2008-06-19 15:26:13 Re: Backend Stats Enhancement Request