Skip site navigation (1) Skip section navigation (2)

Re: UTF8MatchText

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 22:24:59
Message-ID: (view raw or whole thread)
Lists: pgsql-hackerspgsql-patches

Tom Lane wrote:
> * At a pattern backslash, it applies CHAREQ() but then advances
> byte-by-byte over the matched characters (implicitly assuming that none
> of these bytes will look like the magic characters).  While that works
> for backend-safe encodings, it seems a bit strange; you've already paid
> the price of determining the character length once, not to mention
> matching the bytes of the characters once, and then throw that knowledge
> away.  I think BYTEEQ would make more sense in the backslash path.

Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?

> So the actual optimization here is that we do bytewise comparison and
> advancing, but only when we are either at the start of a character
> (on both sides, and the pattern char is not wildcard) or we are in the
> middle of a character (on both sides) and we've already proven that both
> sides matched for the previous byte(s) of the character.

I think that's correct.

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.


One more thing - I'm thinking of rolling up the bytea matching routines 
as well as the text routines to eliminate all the duplication of logic. 
I can do it by a little type casting from bytea* to text* and back 
again, or if that's not acceptable by some preprocessor magic. I think 
the casting is likely to be safe enough in this case - I don't think a 
null byte will hurt us anywhere in this code - and presumably the 
varlena stuff is all the same. Does that sound reasonable?



In response to


pgsql-hackers by date

Next:From: Bruce MomjianDate: 2007-05-17 22:34:47
Subject: Patch status page
Previous:From: Alvaro HerreraDate: 2007-05-17 22:22:02
Subject: Re: [HACKERS] Removing pg_auth_members.grantor (was Grantor name gets lost when grantor role dropped)

pgsql-patches by date

Next:From: Bruce MomjianDate: 2007-05-17 22:40:05
Subject: Re: Updated bitmap index patch
Previous:From: Alvaro HerreraDate: 2007-05-17 22:20:27
Subject: Re: Updated bitmap index patch

Privacy Policy | About PostgreSQL
Copyright © 1996-2015 The PostgreSQL Global Development Group