Skip site navigation (1) Skip section navigation (2)

Re: UTF8MatchText

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: UTF8MatchText
Date: 2007-05-17 22:24:59
Message-ID: 464CD63B.7000609@dunslane.net (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches

Tom Lane wrote:
>
> * At a pattern backslash, it applies CHAREQ() but then advances
> byte-by-byte over the matched characters (implicitly assuming that none
> of these bytes will look like the magic characters).  While that works
> for backend-safe encodings, it seems a bit strange; you've already paid
> the price of determining the character length once, not to mention
> matching the bytes of the characters once, and then throw that knowledge
> away.  I think BYTEEQ would make more sense in the backslash path.
>   

Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?

>
> So the actual optimization here is that we do bytewise comparison and
> advancing, but only when we are either at the start of a character
> (on both sides, and the pattern char is not wildcard) or we are in the
> middle of a character (on both sides) and we've already proven that both
> sides matched for the previous byte(s) of the character.
>   

I think that's correct.

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.
>
> 	
>   


Indeed.

One more thing - I'm thinking of rolling up the bytea matching routines 
as well as the text routines to eliminate all the duplication of logic. 
I can do it by a little type casting from bytea* to text* and back 
again, or if that's not acceptable by some preprocessor magic. I think 
the casting is likely to be safe enough in this case - I don't think a 
null byte will hurt us anywhere in this code - and presumably the 
varlena stuff is all the same. Does that sound reasonable?


cheers

andrew

In response to

Responses

pgsql-hackers by date

Next:From: Bruce MomjianDate: 2007-05-17 22:34:47
Subject: Patch status page
Previous:From: Alvaro HerreraDate: 2007-05-17 22:22:02
Subject: Re: [HACKERS] Removing pg_auth_members.grantor (was Grantor name gets lost when grantor role dropped)

pgsql-patches by date

Next:From: Bruce MomjianDate: 2007-05-17 22:40:05
Subject: Re: Updated bitmap index patch
Previous:From: Alvaro HerreraDate: 2007-05-17 22:20:27
Subject: Re: Updated bitmap index patch

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group