Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Oh! You are right, I was expecting it to try multiple characters at the
>> same position before truncating the string. This change seems to have
>> lobotomized things rather thoroughly. What is the rationale for that?
>> As an example, when dealing with a single-character string, it will fail
>> altogether if the next code value sorts out-of-order, so this seems to
>> me to be a rather large step backwards.
> On this point I believe you are still confused. The old code tried
> one character per position, and the new code tries one character per
> position. Nothing has been lobotomized in any way.
No, on this point you are flat out wrong. Try something like
select ... where f1 like 'p%';
in tt_RU locale, wherein 'q' sorts between 'k' and 'l'. The old code
correctly found that 'r' works as a string greater than 'p'. The new
code fails to find a greater string, because it only tries 'q' and then
gives up. This results in a selectivity estimate much poorer than
Since the stated purpose of this patch is to fix some corner cases
where the code fails to find a greater string, I fail to see why it's
acceptable to introduce some other corner cases that weren't there
> The difference is
> that the old code used a "guess and check" approach to generate the
> character, so there was an inner loop that was trying to generate a
> character (possibly generating various garbage strings that did not
> represent a character along the way) and then, upon success, checked
> the sort order of that single string before truncating and retrying.
You are misreading the old code. The inner loop increments the last
byte, checks for success, and if it hasn't produced a greater string
then it loops around to increment again.
> The fact that we haven't gotten any complaints before suggests that this
> actually works decently well as it stands.
Well, that's true of the old algorithm ;-)
I had likewise thought of the idea of trying some fixed number of
character values at each position, but it's unclear to me why that's
better than allowing an encoding-specific behavior. I don't believe
that we could get away with trying less than a few dozen values, though.
For example, in a situation where case sensitivity is relevant, you
might need to increment past all the upper-case letters to get to a
suitable lower-case letter. I also think that it's probably useful to
try incrementing higher-order bytes of a multibyte character before
giving up --- we just can't afford to do an exhaustive search.
Thus my proposal to let the low-order bytes max out but not cycle.
regards, tom lane
In response to
pgsql-hackers by date
|Next:||From: Eric B. Ridge||Date: 2011-10-30 17:51:26|
|Subject: Re: Thoughts on "SELECT * EXCLUDING (...) FROM ..."?|
|Previous:||From: Martijn van Oosterhout||Date: 2011-10-30 14:27:08|
|Subject: Re: Add socket dir to pg_config..?|
pgsql-bugs by date
|Next:||From: Robert Haas||Date: 2011-10-30 18:49:40|
|Subject: Re: [v9.2] make_greater_string() does not return a string
in some cases|
|Previous:||From: Alexander Lakhin||Date: 2011-10-30 03:46:55|
|Subject: Re: BUG #6277: Money datatype conversion wrong with Russian