Sigh, LIKE indexing is *still* broken in foreign locales

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Moucha Václav <MouchaV(at)Radiomobil(dot)cz>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Sigh, LIKE indexing is *still* broken in foreign locales
Date: 2000-06-08 02:22:06
Message-ID: 14045.960430926@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

=?ISO-8859-2?Q?Moucha_V=E1clav?= <MouchaV(at)Radiomobil(dot)cz> writes:
> 1. Compilation
> ./configure --enable-locale # not needed for RPMS precompiled binaries

> 2. Starting postmaster
> export LC_CTYPE=cs_CZ
> export LC_COLLATE=cs_CZ # this setting is important for the
> bug result
> postmaster -S -D /home/pgsql/data -o '-Fe'

> 3. SQL steps
> create table test (name text);
> insert into test values (''); # the first char is E1 from LATIN 2
> coding
> insert into test values ('b');
> create index test_index on test (name);
> set cpu_tuple_cost=1; # force backend to use index
> scanning
> select * from test where name like '%';

> BUG: Only 1 line is selected with '' only instead of both lines.

The problem here is that given the search pattern '\341%', the planner
generates index limit conditions
name >= '\341' AND name < '\342';

Apparently, in CZ locale it is true that '\341' is less than '\342',
but it does not follow from that that all strings starting with '\341'
are less than '\342'. In fact '\341b' is considered greater than '\342'.

Since '\341' and '\342' are two different accented forms of 'a'
(if I'm looking at the right character set), this is perhaps not so
improbable as all that. Evidently the collation rule is that different
accent forms sort the same unless the strings would otherwise be
considered equal, in which case an ordering is assigned to them.

So, the rule we thought we had for generating index bounds falls flat,
and we're back to the same old question: given a proposed prefix string,
how can we generate bounds that are certain to be considered <= and >=
all strings starting with that prefix?

I am now thinking that maybe we should search for a string that compares
greater than "fooz" when the prefix is "foo" --- that is, append a 'z'
to the prefix string. But I wouldn't be surprised if that fails too
in some locales.

I'm also wondering if the left-hand inequality ('foo' <= any string
beginning with 'foo') might fail in some locales ... we haven't seen
it reported but who knows ...

regards, tom lane

In response to

  • LIKE bug at 2000-06-07 08:16:15 from Moucha Václav

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Giles Lean 2000-06-08 06:41:59 Re: [HACKERS] Sigh, LIKE indexing is *still* broken in foreign locales
Previous Message Tom Lane 2000-06-07 23:22:25 Re: Bug?

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2000-06-08 02:34:36 Re: Doc updates for index cost estimator change
Previous Message Michael Simms 2000-06-08 01:16:03 Re: Apparent deadlock 7.0.1