Re: Yet another LIKE-indexing scheme

From: Erich Stamberger <eberger(at)gewi(dot)kfunigraz(dot)ac(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Yet another LIKE-indexing scheme
Date: 2000-09-03 17:35:05
Message-ID: Pine.LNX.4.21.0009031546500.4051-100000@gewi.kfunigraz.ac.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On Sat, 2 Sep 2000, Tom Lane wrote:

> This would, for example, fix the example given by Erich Stamberger:
>
> > Another interresting feature of Czech collation is:
> >
> > H < "CH" < I
> >
> > and:
> >
> > B < C < C + CARON < D .. < H < "CH" < I
> >
> > So what happens with "WHERE name like 'Czec%`" ?
>
> Our existing code fails because it generates WHERE name >= 'Czec' AND
> name < 'Czed'; it will therefore not find names beginning 'Czech'
> because those are in another part of the index, between 'Czeh' and
> 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work.

The Problem is: What tells us, that it is 'f' which sorts
after 'e' in that locale? In the "C" locale you can
simply add One to the character's code to get the next one,
since the numerical ordering of the encoding is identical to the
collation: 'e' + 1 = 'f' and 'e' < 'f'. This is *not* true
for *every* real-world language-encoding-pair in the world -
at least not for characters with codes above 127 and characters
with codes below 65.

In the example above we are in luck, although there are
additional characters between 'e' and 'f' in Czech sorting:

Collation => .. < 'e' = 'e + acute' = 'e + caron' < 'f'
Encoding => 101 < 233 < 236 > 102

To my knowledge the ISO C API doesn't provide an interface
to collation information (IAPITA!). There are no succ()
and pred() functions like in PASCAL for example.

And even if these functions could be emulated, I'm not
sure about possible "side effects". French for example,
has even more funny rules ("funny" from a programmer's point
of view): Accented characters which appear later in a string
are more important than accented characters which appear earlier.

IMHO, using the OS's locale support in databases asks
for trouble anyway:

Who guaratees that the strcolls/localedefs floating
around behave the same way?

What, if some kind soul of system admistrator updates the
OS and fixes a buggy locale definition file (maybe without
knowing)? The next UPDATE or INSERT coming along will
damage the indices of all databases using the affected locale.
Even a simple SELECT may yield strange results.

>
> Are there examples where this still doesn't work? (Funny sort rules
> for trigraphs would break it, I'm sure, unless we drop two characters
> instead of just one.)
>

I don't know if there are any locales, where removing/appending
"something" from/to a string can result in a higher/lower
collation weight: "xyzab < xyz" or "xyz > xyzab".

> Obviously we could still keep the last character in ASCII locale.
> That would be a good thing since it'd reduce the number of tuples
> scanned. Is there a portable way to determine whether it's safe to
> do so in other locales? (Some inquiry function about whether the sort
> ordering has any digraph or two-to-one rules might help, but I don't
> know if there is one.)
>

Even ASCII (7-bit) encoded *locales* may be in big trouble here:

gewi:~$ cat en.txt
1
2
?
?2
?A
?a
-A
-a
+
-
/
a
b
A
B

gewi:~$ export LANG="C"
gewi:~$ sort en.txt
+
-
-A
-a
/
1
2
?
?2
?A
?a
A
B
a
b

gewi:~$ export LANG="en_US"
gewi:~$ sort en.txt
-
?
/
+
1
?2
2
-A
?A
A
-a
?a
a
B
b

.. at least strings with punctuation characters will fail
in certain cases.

--
Erich (still thinking)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-09-03 22:16:33 Re: psql can crash the backend on login
Previous Message pgsql-bugs 2000-09-03 16:40:58 psql can crash the backend on login