Quick Links

Re: UTF8MatchText

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: UTF8MatchText
Date:	2007-05-17 21:18:35
Message-ID:	13130.1179436715@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> From my WIP patch, here's where the difference appears to be - note
> that UTF8 branch has two NextByte calls at the bottom, unlike the other
> branch:

Oh, I see: NextChar is still "real" but the patch is willing to have t
and p pointing into the middle of an MB character. That's a bit
risky. I think it works but it's making at least the following
undocumented assumptions:

* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters). While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away. I think BYTEEQ would make more sense in the backslash path.

* At pattern % or _, it's critical that we do NextChar not NextByte
on the data side. Else t is pointing into the middle of an MB sequence
when p isn't, and we have various out-of-sync conditions to worry about,
notably possibly calling NextChar when t is not pointing at the first
byte of the character, which will result in a wrong answer about the
character length.

* We *must* do NextChar not NextByte for _ since we have to match it to
exactly one logical character, not byte. You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.

So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.

On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings? Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.

regards, tom lane

In response to

Re: UTF8MatchText at 2007-05-17 19:57:25 from Andrew Dunstan

Responses

Re: UTF8MatchText at 2007-05-17 22:24:59 from Andrew Dunstan
Re: UTF8MatchText at 2007-05-18 02:07:20 from ITAGAKI Takahiro
Re: UTF8MatchText at 2007-05-20 07:44:54 from Dennis Bjorklund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2007-05-17 21:29:18	Re: 8.3 release date on web site
Previous Message	Tom Lane	2007-05-17 20:40:07	Re: CREATE TABLE LIKE INCLUDING INDEXES support

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Bruce Momjian	2007-05-17 21:35:01	Re: Updated bitmap index patch
Previous Message	Tom Lane	2007-05-17 20:40:07	Re: CREATE TABLE LIKE INCLUDING INDEXES support