Re: [v9.2] make_greater_string() does not return a string in some cases

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-10-30 00:04:09
Message-ID: CA+TgmoZbXGOCYCBrb1T6w8sVDxOA4yP-B8CPtXv4FxxDwRWwtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Sat, Oct 29, 2011 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, it might not be strictly necessary for pg_utf8_increment() and
>> pg_eucjp_increment(), but it's clearly necessary for the generic
>> incrementer function for exactly the same reason it was needed in the
>> old coding.  I suppose we could weaken the rule to "you must leave a
>> valid character behind rather than a bunch of bytes that doesn't
>> encode to a character", but the cycle savings are negligible and the
>> current rule seems both simpler and more bullet-proof.
>
> No, it's *not* necessary any more, AFAICS.  make_greater_string discards
> the data and overwrites it with a null instantly upon getting a failure
> return from the incrementer.  The reason we used to need it was that we
> did pg_mbcliplen after failing to increment, but now we do that before
> we ever increment anything, so we already know the length of the last
> character.  It doesn't matter whether those bytes are still valid or
> contain garbage.

Oh, dude. I think you are right. I guess we can rip that crap out
then. That's much cleaner.

>>> I'm also quite distressed that you ignored my advice to limit the number
>>> of combinations tried.  This patch could be horribly slow when dealing
>>> with wide characters, eg think what will happen when starting from
>>> U+10000.
>
>> Uh, I think it will try at most one character in that position and
>> then truncate away that character entirely, per my last email on this
>> topic (to which you never responded):
>> http://archives.postgresql.org/pgsql-hackers/2011-09/msg01195.php
>
> Oh!  You are right, I was expecting it to try multiple characters at the
> same position before truncating the string.  This change seems to have
> lobotomized things rather thoroughly.  What is the rationale for that?
> As an example, when dealing with a single-character string, it will fail
> altogether if the next code value sorts out-of-order, so this seems to
> me to be a rather large step backwards.
>
> I think we ought to go back to the previous design of incrementing till
> failure and then truncating, which puts the onus on the incrementer to
> make a reasonable tradeoff of how many combinations to try per character
> position.  There's a simple tweak we could make to the patch to limit
> that: once we've maxed out a lower-order byte of a multibyte char,
> *don't reset it to minimum*, just move on to incrementing the next
> higher byte.  This preserves the old property that the maximum number of
> combinations tried is bounded by 256 * string's length in bytes.

On this point I believe you are still confused. The old code tried
one character per position, and the new code tries one character per
position. Nothing has been lobotomized in any way. The difference is
that the old code used a "guess and check" approach to generate the
character, so there was an inner loop that was trying to generate a
character (possibly generating various garbage strings that did not
represent a character along the way) and then, upon success, checked
the sort order of that single string before truncating and retrying.
The new code does exactly the same thing in the outer loop - i.e.
truncate one character per iteration - but the inner loop has, at
least for UTF-8 and EUC-JP, been replaced with an algorithm that is
guaranteed to produce a valid character without needing to loop.

Now having said that, I think there is a possibility for some
improvement here. If we know we're not going to spend a lot of time
uselessly screwing around trying to get something that will pass
pg_verifymbstr(), then we could probably afford to call ltproc several
times per position, rather than just once. But that's not restoring
the behavior of the old algorithm; that's improving on the old
algorithm. And before we do that, we need to think about a couple of
things: first, that silly looping behavior is still there for anything
other than UTF-8 and EUC-JP. Until we have constant-time increment
functions for every encoding we support, we probably don't want to get
too jiggy with it; second, we need to convince ourselves that this
will succeed in a meaningful number of cases where the current
algorithm fails. I played around a bit with the UTF-8 case (with
collation = en_US.UTF-8) before committing this and I suspect that
trying 4 or 5 characters per position could be a win - you might for
example be looking at something like an accented E, and you might have
several different versions of E in a row before you get to something
that's no longer E-like. If you get to that point and still don't
find something that compares favorably, it's probably time to throw in
the towel. Another idea would be to do the first increment by one,
and then increment by two, four, eight before giving up, or something
like that. It probably needs some research, and frankly I'm happy to
leave it to someone who is having a real-world problem with it. The
fact that we haven't gotten any complaints before suggests that this
actually works decently well as it stands.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexander Lakhin 2011-10-30 03:46:55 Re: BUG #6277: Money datatype conversion wrong with Russian locale
Previous Message Tom Lane 2011-10-29 20:36:06 Re: [v9.2] make_greater_string() does not return a string in some cases

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru Hanada 2011-10-30 01:22:23 Re: pgsql_fdw, FDW for PostgreSQL server
Previous Message Darren Duncan 2011-10-29 22:55:21 Re: Thoughts on "SELECT * EXCLUDING (...) FROM ..."?