Re: Re: LIKE gripes

From: Thomas Lockhart <lockhart(at)alumni(dot)caltech(dot)edu>
To: Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: LIKE gripes
Date: 2000-08-08 15:19:05
Message-ID: 399024E9.6CB300F6@alumni.caltech.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Where has MULTIBYTE Stuff in like.c gone ?

Uh, I was wondering where it was in the first place! Will fix it asap...

There was some string copying stuff in a middle layer of the like()
code, but I had thought that it was there only to get a null-terminated
string. When I rewrote the code to eliminate the need for null
termination (by using the length attribute of the text data type) then
the need for copying went away. Or so I thought :(

The other piece to the puzzle is that the lowest-level like() support
routine traversed the strings using the increment operator, and so I
didn't understand that there was any MB support in there. I now see that
*all* of these strings get stuffed into unsigned int arrays during
copying; I had (sort of) understood some of the encoding schemes (most
use a combination of one to three byte sequences for each character) and
didn't realize that this normalization was being done on the fly.

So, this answers some questions I have related to implementing character
sets:

1) For each character set, we would need to provide operators for "next
character" and for boolean comparisons for each character set. Why don't
we have those now? Answer: because everything is getting promoted to a
32-bit internal encoding every time a comparison or traversal is
required.

2) For each character set, we would need to provide conversion functions
to other "compatible" character sets, or to a character "superset". Why
don't we have those conversion functions? Answer: we do! There is an
internal 32-bit encoding within which all comparisons are done.

Anyway, I think it will be pretty easy to put the MB stuff back in, by
#ifdef'ing some string copying inside each of the routines (such as
namelike()). The underlying routine no longer requires a null-terminated
string (using explicit lengths instead) so I'll generate those lengths
in the same place unless they are already provided by the char->int MB
support code.

In the future, I'd like to see us use alternate encodings as-is, or as a
common set like UniCode (16 bits wide afaik) rather than having to do
this widening to 32 bits on the fly. Then, each supported character set
can be efficiently manipulated internally, and only converted to another
encoding when mixing with another character set.

Any and all advice welcome and accepted (though "keep your hands off the
MB code!" seems a bit too late ;)

Sorry for the shake-up...

- Thomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Philip Warner 2000-08-08 15:40:16 Re: [HACKERS] Re: Trouble with float4 afterupgrading from 6.5.3 to 7.0.2
Previous Message Steve Heaven 2000-08-08 14:36:51 Re: Query plan and sub-queries