Re: Unicode combining characters

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: ZeugswetterA(at)spardat(dot)at
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgman(at)candle(dot)pha(dot)pa(dot)us, phede-ml(at)islande(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode combining characters
Date: 2001-10-04 02:16:42
Message-ID: 20011004111642R.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> Ok. I ran the modified test (now the iteration is reduced to 100000 in
> liketest()). As you can see, there's huge difference. MB seems up to
> ~8 times slower:-< There seems some problems existing in the
> implementation. Considering REGEX is not so slow, maybe we should
> employ the same design as REGEX. i.e. using wide charcters, not
> multibyte streams...
>
> MB+LIKE
> Total runtime: 1321.58 msec
> Total runtime: 1718.03 msec
> Total runtime: 2519.97 msec
> Total runtime: 4187.05 msec
> Total runtime: 7629.24 msec
> Total runtime: 14456.45 msec
> Total runtime: 17320.14 msec
> Total runtime: 17323.65 msec
> Total runtime: 17321.51 msec
>
> noMB+LIKE
> Total runtime: 964.90 msec
> Total runtime: 993.09 msec
> Total runtime: 1057.40 msec
> Total runtime: 1192.68 msec
> Total runtime: 1494.59 msec
> Total runtime: 2078.75 msec
> Total runtime: 2328.77 msec
> Total runtime: 2326.38 msec
> Total runtime: 2330.53 msec

I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).

So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.

Here is the MB case result with SQL_ASCII encoding.

Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2001-10-04 02:47:28 Re: BUG: text(varchar) truncates at 31 bytes
Previous Message Laurette Cisneros 2001-10-04 00:02:59 Timestamp, fractional seconds problem

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2001-10-04 03:05:16 Re: Unicode combining characters
Previous Message Bruce Momjian 2001-10-04 00:54:57 Re: Trailing semicolons in psql patch