Yet another LIKE-indexing scheme

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Yet another LIKE-indexing scheme
Date: 2000-09-02 17:39:47
Message-ID: 1429.967916387@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I had another thought about fixing our problems with deriving index
bounds for LIKE patterns in non-ASCII locales. (If you don't remember
the gory details here, please re-read thread
Sigh, LIKE indexing is *still* broken in foreign locales
from pgsql-hackers archives of 7 to 10 June, 2000; there are also many
previous go-rounds about this long-standing issue.)

The problems that I've been told about seem to center around one- and
two-character patterns that have special sort rules in some locales.
Could we work around these problems by dropping one or perhaps two
characters from the end of the given LIKE prefix? For example, given
WHERE name LIKE 'foobar%'
drop the last fixed character ('r') and generate index bounds from what
remains, using the same algorithm as in 7.0. So the index bounds would
become
WHERE name >= 'fooba' AND name < 'foobb'
(at least in ASCII locale --- to make the upper bound, we'd search for
a string considered greater than 'fooba' by the local strcmp()).

The truncation would need to be multibyte-aware, of course.

This would, for example, fix the example given by Erich Stamberger:

> Another interresting feature of Czech collation is:
>
> H < "CH" < I
>
> and:
>
> B < C < C + CARON < D .. < H < "CH" < I
>
> So what happens with "WHERE name like 'Czec%`" ?

Our existing code fails because it generates WHERE name >= 'Czec' AND
name < 'Czed'; it will therefore not find names beginning 'Czech'
because those are in another part of the index, between 'Czeh' and
'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work.

Are there examples where this still doesn't work? (Funny sort rules
for trigraphs would break it, I'm sure, unless we drop two characters
instead of just one.)

Obviously we could still keep the last character in ASCII locale.
That would be a good thing since it'd reduce the number of tuples
scanned. Is there a portable way to determine whether it's safe to
do so in other locales? (Some inquiry function about whether the sort
ordering has any digraph or two-to-one rules might help, but I don't
know if there is one.)

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephan Szabo 2000-09-02 17:51:03 Re: Really bad/weird stuff with views over tables in 7.0.2
Previous Message Tom Lane 2000-09-02 16:48:46 Re: PL/Perl compilation error