Re: UTF8MatchText

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 19:57:25
Message-ID: 464CB3A5.9020600@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> Except that the entire point of this patch is to dumb down NextChar to
>>> be the same as NextByte for UTF8 strings.
>>>
>
>
>> That's not what I see in (what I think is) the latest submission, which
>> includes this snippet:
>>
>
> [ scratches head... ] OK, then I think I totally missed what this patch
> is trying to accomplish; because this code looks just the same as the
> existing multibyte-character operations. Where does the performance
> improvement come from?
>
>
>

That's what bothered me. The trouble is that we have so much code that
looks *almost* identical.

From my WIP patch, here's where the difference appears to be - note
that UTF8 branch has two NextByte calls at the bottom, unlike the other
branch:

#ifdef UTF8_OPT
/*
* UTF8 is optimised to do byte at a time matching in most cases,
* thus saving expensive calls to NextChar.
*
* UTF8 has disjoint representations for first-bytes and
* not-first-bytes of MB characters, and thus it is
* impossible to make a false match in which an MB pattern
* character is matched to the end of one data character
* plus the start of another.
* In character sets without that property, we have to use the
* slow way to ensure we don't make out-of-sync matches.
*/
else if (*p == '_')
{
NextChar(t, tlen);
NextByte(p, plen);
continue;
}
else if (!BYTEEQ(t, p))
{
/*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
return LIKE_FALSE;
}

NextByte(t, tlen);
NextByte(p, plen);
#else
/*
* Branch for non-utf8 multi-byte charsets and also for single-byte
* charsets which don't gain any benefit from the above
optimisation.
*/

else if ((*p != '_') && !CHAREQ(t, p))
{
/*
* Not the single-character wildcard and no explicit match? Then
* time to quit...
*/
return LIKE_FALSE;
}

NextChar(t, tlen);
NextChar(p, plen);

#endif /* UTF8_OPT */

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-05-17 20:04:32 Re: CREATE TABLE LIKE INCLUDING INDEXES support
Previous Message Marc G. Fournier 2007-05-17 19:49:24 Re: 8.3 release date on web site

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-05-17 20:04:32 Re: CREATE TABLE LIKE INCLUDING INDEXES support
Previous Message Tom Lane 2007-05-17 19:48:47 Re: UTF8MatchText