Re: Re: LIKE gripes

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: lockhart(at)alumni(dot)caltech(dot)edu
Cc: Inoue(at)tpf(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: LIKE gripes
Date: 2000-08-09 12:45:13
Message-ID: 20000809214513M.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > Where has MULTIBYTE Stuff in like.c gone ?

I didn't know that:-)

> Uh, I was wondering where it was in the first place! Will fix it asap...
>
> There was some string copying stuff in a middle layer of the like()
> code, but I had thought that it was there only to get a null-terminated
> string. When I rewrote the code to eliminate the need for null
> termination (by using the length attribute of the text data type) then
> the need for copying went away. Or so I thought :(
>
> The other piece to the puzzle is that the lowest-level like() support
> routine traversed the strings using the increment operator, and so I
> didn't understand that there was any MB support in there. I now see that
> *all* of these strings get stuffed into unsigned int arrays during
> copying; I had (sort of) understood some of the encoding schemes (most
> use a combination of one to three byte sequences for each character) and
> didn't realize that this normalization was being done on the fly.
>
> So, this answers some questions I have related to implementing character
> sets:
>
> 1) For each character set, we would need to provide operators for "next
> character" and for boolean comparisons for each character set. Why don't
> we have those now? Answer: because everything is getting promoted to a
> 32-bit internal encoding every time a comparison or traversal is
> required.

MB has something similar to the "next character" fucntion called
pg_encoding_mblen. It tells the length of the MB word pointed to so
that you could move forward to the next MB word etc.

> 2) For each character set, we would need to provide conversion functions
> to other "compatible" character sets, or to a character "superset". Why
> don't we have those conversion functions? Answer: we do! There is an
> internal 32-bit encoding within which all comparisons are done.

Right.

> Anyway, I think it will be pretty easy to put the MB stuff back in, by
> #ifdef'ing some string copying inside each of the routines (such as
> namelike()). The underlying routine no longer requires a null-terminated
> string (using explicit lengths instead) so I'll generate those lengths
> in the same place unless they are already provided by the char->int MB
> support code.

I have not taken a look at your new like code, but I guess you could use

pg_mbstrlen(const unsigned char *mbstr)

It tells the number of words in mbstr (however mbstr needs to null
terminated).

> In the future, I'd like to see us use alternate encodings as-is, or as a
> common set like UniCode (16 bits wide afaik) rather than having to do
> this widening to 32 bits on the fly. Then, each supported character set
> can be efficiently manipulated internally, and only converted to another
> encoding when mixing with another character set.

If you are planning to convert everything to Unicode or whatever
before storing them into the disk, I'd like to object the idea. It's
not only the waste of disk space but will bring serious performance
degration. For example, each ISO 8859 byte occupies 2 bytes after
converted to Unicode. I dont't think this two times disk space
consuming is acceptable.
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Jack Olrik 2000-08-09 12:54:10 Possible bug in 'set constraints all deferred';
Previous Message G. Anthony Reina 2000-08-08 22:57:44 Re: Extending to 32K row limit