Skip site navigation (1) Skip section navigation (2)

Re: like/ilike improvements

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: like/ilike improvements
Date: 2007-05-25 03:20:51
Message-ID: 29948.1180063251@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
I wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Yes, I agree completely. However it looks to me like IsFirstByte will in 
>> fact always be true when we get to call NextChar for matching "_" for UTF8.

> If that's true, the patch is failing to achieve its goal of treating %
> bytewise ...

OK, I studied it a bit more and now see what you're driving at: in this
form of the patch, we treat % bytewise unless it is followed by _, in
which case we treat it char-wise.  That seems a good tradeoff,
considering that such a pattern is probably pretty uncommon --- we
should be willing to handle it a bit slower to simplify other cases.

The patch seems still not right though, because you are advancing by
bytes when \ follows %, and that isn't correct in a non-UTF8 encoding.
The invariant we are actually insisting on here is that at the time of
entry to MatchText(), whether initial or recursive, t and p must be
correctly char-aligned.  I suggest the attached revision of the logic as
a way to clarify that, and maybe save a cycle or two in the inner loop
as well.

Yes, I concur we needn't bother with IsFirstByte except maybe as an
Assert.  If it is an Assert it should be up at the top of the function.

			regards, tom lane

		else if (*p == '%')
		{
			/* %% is the same as % according to the SQL standard */
			/* Advance past all %'s */
			do {
				NextByte(p, plen);
			} while (plen > 0 && *p == '%');
			/* Trailing percent matches everything. */
			if (plen <= 0)
				return LIKE_TRUE;

			/*
			 * Otherwise, scan for a text position at which we can match the
			 * rest of the pattern.
			 */
			if (*p == '_')
			{
				/*
				 * If we have %_ in the pattern, we need to advance char-wise
				 * to avoid starting the recursive call on a non-char boundary.
				 * This could be made more efficient, but at the cost of making
				 * other paths slower; it seems not a common case, so handle
				 * it this way.
				 */
				while (tlen > 0)
				{
					int			matched = MatchText(t, tlen, p, plen);
						
					if (matched != LIKE_FALSE)
							return matched; /* TRUE or ABORT */

					NextChar(t, tlen);
				}
			}
			else
			{
				/*
				 * Optimization to prevent most recursion: don't recurse
				 * unless first pattern char matches the text char.
				 */
				char	firstpat;

				if (*p == '\\')
				{
					if (plen < 2)
						return LIKE_FALSE;
					firstpat = p[1];
				}
				else
					firstpat = *p;

				while (tlen > 0)
				{
					if (*t == firstpat)
					{
						int			matched = MatchText(t, tlen, p, plen);
						
						if (matched != LIKE_FALSE)
							return matched; /* TRUE or ABORT */
					}

					/*
					 * In UTF8 it's cheaper to advance bytewise and do
					 * useless comparisons of firstpat to non-first bytes
					 * than to invoke pg_mblen.  In other character sets
					 * we must advance by chars to avoid spurious matches.
					 */
#ifdef UTF8OPT
					NextByte(t, tlen);
#else
					NextChar(t, tlen);
#endif
				}
			}

			/*
			 * End of text with no match, so no point in trying later places
			 * to start matching this pattern.
			 */
			return LIKE_ABORT;
		}

In response to

Responses

pgsql-hackers by date

Next:From: Andrew DunstanDate: 2007-05-25 03:21:35
Subject: Re: like/ilike improvements
Previous:From: Guido BarosioDate: 2007-05-25 02:11:49
Subject: block size

pgsql-patches by date

Next:From: Andrew DunstanDate: 2007-05-25 03:21:35
Subject: Re: like/ilike improvements
Previous:From: Tom LaneDate: 2007-05-25 02:03:42
Subject: Re: like/ilike improvements

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group