LIKE fixed(?) for non-ASCII collation orders

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: LIKE fixed(?) for non-ASCII collation orders
Date: 1999-12-31 05:54:06
Message-ID: 17785.946619646@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have just committed what I hope is the final solution for the problem
of LIKE index optimization in non-ASCII locales. indxpath.c now
generates both a lower and upper indexqual in all locales. For example,
x LIKE 'foo%t'
will create indexqual conditions
x >= 'foo' AND x < 'fop'
The "<" condition is omitted only if the code is unable to produce a
string greater than the pattern's constant prefix.

Locale-specific variations in collation order are handled by the
cut-and-try method I suggested a while ago:

> The approach I was considering for fixing the problem was to use a
> loop that would repeatedly try to generate a string greater than the
> prefix string. The basic loop step would increment the rightmost
> byte as Goran has done (or, if it's already up to the limit, chop
> it off and increment the next character position). Then test to
> see whether the '<' operator actually believes the result is
> greater than the given prefix, and repeat if not.

Although I believe that the code will work in non-ASCII locales and
MULTIBYTE character sets, I'm not set up to try it easily. Could
some folks try it out and report back?

The critical subroutine is attached, so if you prefer to eyeball
the code, here it is...

regards, tom lane

/*
* Try to generate a string greater than the given string or any string it is
* a prefix of. If successful, return a palloc'd string; else return NULL.
*
* To work correctly in non-ASCII locales with weird collation orders,
* we cannot simply increment "foo" to "fop" --- we have to check whether
* we actually produced a string greater than the given one. If not,
* increment the righthand byte again and repeat. If we max out the righthand
* byte, truncate off the last character and start incrementing the next.
* For example, if "z" were the last character in the sort order, then we
* could produce "foo" as a string greater than "fonz".
*
* This could be rather slow in the worst case, but in most cases we won't
* have to try more than one or two strings before succeeding.
*
* XXX in a sufficiently weird locale, this might produce incorrect results?
* For example, in German I believe "ss" is treated specially --- if we are
* given "foos" and return "foot", will this actually be greater than "fooss"?
*/
static char *
make_greater_string(const char * str, Oid datatype)
{
char *workstr;
int len;

/* Make a modifiable copy, which will be our return value if successful */
workstr = pstrdup((char *) str);

while ((len = strlen(workstr)) > 0)
{
unsigned char *lastchar = (unsigned char *) (workstr + len - 1);

/*
* Try to generate a larger string by incrementing the last byte.
*/
while (*lastchar < (unsigned char) 255)
{
(*lastchar)++;
if (string_lessthan(str, workstr, datatype))
return workstr; /* Success! */
}
/*
* Truncate off the last character, which might be more than 1 byte
* in MULTIBYTE case.
*/
#ifdef MULTIBYTE
len = pg_mbcliplen((const unsigned char *) workstr, len, len-1);
workstr[len] = '\0';
#else
*lastchar = '\0';
#endif
}

/* Failed... */
pfree(workstr);
return NULL;
}

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 1999-12-31 06:08:01 When are subqueries needed
Previous Message Bruce Momjian 1999-12-31 03:34:55 Re: [HACKERS] Is backend-libpq's "PQexec/PQfn/portal" code dead?