Re: [v9.2] make_greater_string() does not return a string in some cases

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 01:49:27
Message-ID: 6425.1316656167@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> As I understand it, the issue here is that when we try to generate
> suitable > and < quals for a LIKE expression, we need to find a string
> which is greater than the prefix we're searching for, and the existing
> algorithm sometimes fails. But there seems to be some dispute over
> how likely this is to occur. Tom argues that the case is so rare that
> we shouldn't worry about it:
> http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php

Well, I didn't like the amount of overhead that was implied by that
proposal. I don't have a great deal of difficulty with the idea of
encoding-specific character incrementers, especially since they could
presumably be more efficient than the current brute force approach
wherein we call pg_verifymbstr on the entire string after mangling
just the last byte.

The main risk that I foresee with the proposed approach is that if you
have, say, a 4-byte final character, you could iterate through a *whole
lot* (millions) of larger encoded characters, with absolutely no hope of
making a greater string that way when the determining difference occurs
at some earlier character. And then when you do run out, you could
waste just as much time at the immediately preceding character, etc etc.
The existing algorithm is a compromise between thoroughness of
investigation of different values of the last character versus speed of
falling back to change the preceding character instead. I'd be the
first to say that incrementing only the last byte is a very
quick-and-dirty heuristic for making that happen. But I think it would
be unwise to allow the thing to consider more than a few hundred values
for a character position before giving up. Possibly the
encoding-specific incrementer could be designed to run through all legal
values of the last byte, then start skipping larger and larger ranges
--- maybe just move directly to incrementing the first byte.

Aside from that issue, the submitted patch seems to need quite a lot of
detail work; for instance, I noted an unconstrained memcpy into a 4-byte
local buffer, as well as lots and lots of violations of PG house style.
That's certainly all fixable but somebody will have to go through it.

> One random idea I have is - instead of generating > and < clauses,
> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
> 1, length(b)) = b? We'd need to do something about the selectivity,
> but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Robert Haas 2011-09-22 03:04:02 Re: [v9.2] make_greater_string() does not return a string in some cases
Previous Message Robert Haas 2011-09-22 01:28:50 Re: BUG #6202: type of union column selection bug

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-09-22 03:04:02 Re: [v9.2] make_greater_string() does not return a string in some cases
Previous Message Tom Lane 2011-09-22 01:12:02 Re: citext operator precedence fix