Re: UTF8MatchText

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-18 03:06:05
Message-ID: 464D181D.9010307@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>
>> Yes, I only used the 'disjoint representations for first-bytes and
>> not-first-bytes of MB characters' feature in UTF8. Other encodings
>> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
>> not cope with those encodings; If we have '[AB][AB]' in a table and
>> search it with LIKE '%[BA]%', we judge that they are matched by mistake.
>>
>
> AFAICS, the patch does *not* make that mistake because % will not
> advance over a fractional character.
>
>

Yeah, I think that's right.

Attached is my current WIP patch. If we decide that this optimisation
can in fact be applied to all backend encodings, that will be easily
incorporated. It will simplify the code further. Note that all the
common code in the MatchText and do_like_escape functions has been
factored - and the bytea functions just call the single-byte text
versions - AFAICS the effect will be identical to having the specialised
versions. (I'm always happy when code volume can be reduced.)

cheers

andrew

Attachment Content-Type Size
utf8.patch text/x-patch 24.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-05-18 03:31:22 Re: UTF8MatchText
Previous Message Greg Smith 2007-05-18 03:02:31 Re: Not ready for 8.3

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-05-18 03:31:22 Re: UTF8MatchText
Previous Message Tom Lane 2007-05-18 02:35:06 Re: UTF8MatchText