Re: UTF8MatchText

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 17:33:08
Message-ID: 3999.1179423188@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?

I don't think this is a correct analysis. If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes. But it was
stated somewhere upthread that that doesn't actually work. Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-05-17 17:39:41 Re: UTF8MatchText
Previous Message Heikki Linnakangas 2007-05-17 17:27:09 Re: Patch queue triage

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-05-17 17:39:41 Re: UTF8MatchText
Previous Message Heikki Linnakangas 2007-05-17 17:28:13 Re: Seq scans status update