Re: Re: LIKE gripes

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: lockhart(at)alumni(dot)caltech(dot)edu
Cc: t-ishii(at)sra(dot)co(dot)jp, Inoue(at)tpf(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: LIKE gripes
Date: 2000-08-11 08:13:47
Message-ID: 20000811171347P.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> To get the length I'm now just running through the output string looking
> for a zero value. This should be more efficient than reading the
> original string twice; it might be nice if the conversion routines
> (which now return nothing) returned the actual number of pg_wchars in
> the output.

Sounds resonable. I'm going to enhance them as you suggested.

> The original like() code allocates a pg_wchar array dimensioned by the
> number of bytes in the input string (which happens to be the absolute
> upper limit for the size of the 32-bit-encoded string). Worst case, this
> results in a 4-1 expansion of memory, and always requires a
> palloc()/pfree() for each call to the comparison routines.

Right.

There would be another approach to avoid use such that extra memory
space. However I am not sure it worth to implement right now.

> I think I have a solution for the current code; could someone test its
> behavior with MB enabled? It is now committed to the source tree; I know
> it compiles, but afaik am not equipped to test it :(

It passed the MB test, but fails the string test. Yes, I know it fails
becasue ILIKE for MB is not implemented (yet). I'm looking forward to
implement the missing part. Is it ok for you, Thomas?

> I am not planning on converting everything to UniCode for disk storage.

Glad to hear that.

> What I would *like* to do is the following:
>
> 1) support each encoding "natively", using Postgres' type system to
> distinguish between them. This would allow strings with the same
> encodings to be used without conversion, and would both minimize storage
> requirements *and* run-time conversion costs.
>
> 2) support conversions between encodings, again using Postgres' type
> system to suggest the appropriate conversion routines. This would allow
> strings with different but compatible encodings to be mixed, but
> requires internal conversions *only* if someone is mixing encodings
> inside their database.
>
> 3) one of the supported encodings might be Unicode, and if one chooses,
> that could be used for on-disk storage. Same with the other existing
> encodings.
>
> 4) this difference approach to encoding support can coexist with the
> existing MB support since (1) - (3) is done without mention of existing
> MB internal features. So you can choose which scheme to use, and can
> test the new scheme without breaking the existing one.
>
> imho this comes closer to one of the important goals of maximizing
> performance for internal operations (since there is less internal string
> copying/conversion required), even at the expense of extra conversion
> cost when doing input/output (a good trade since *usually* there are
> lots of internal operations to a few i/o operations).
>
> Comments?

Please note that existing MB implementation does not need such an
extra conversion cost except some MB-aware-functions(text_length
etc.), regex, like and the input/output stage. Also MB stores native
encodings 'as is' onto the disk.

Anyway, it looks like MB would eventually be merged into/deplicated by
your new implementaion of multiple encodings support.

BTW, Thomas, do you have a plan to support collation functions?
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Allan Huffman 2000-08-11 09:37:52 db Comparisons - Road Show
Previous Message Stephan Szabo 2000-08-11 06:24:16 Re: Arrays and foreign keys