Re: POSIX regex performance bug in 7.3 Vs. 7.2

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: wade <wade(at)wavefire(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: POSIX regex performance bug in 7.3 Vs. 7.2
Date: 2003-02-04 16:46:31
Message-ID: 14971.1044377191@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

wade <wade(at)wavefire(dot)com> writes:
> I redid my trials with the same data set on 7.2.3 --with-multibyte and I
> get the same brutal performance hit, so it is definitely a
> multibyte-specific problem.
>
> There are only about 1000 words that appear more than once (2 or 3 times)
> in 27k rows.

Right, so the caching of compiled regexps that regexp.c does is of no
help, and any change in its behavior in 7.3 wouldn't have made any
difference anyway. I leapt to a conclusion after reviewing the CVS
logs for pertinent changes, but it was the wrong conclusion. The true
problem is that MULTIBYTE is now forced on, and that causes some
loops in the regexp compiler to change from 256 to 65536 iterations.

I believe if you change NC in src/include/regex/utils.h from its new
value of 65536 back to 256, performance will go back where it was.
This will *not* do if you run any multibyte character sets --- but
as long as the database is all ASCII or ISO-8859-whatever, it should
be a safe hack that will let you use 7.3.*.

Rather than trying to band-aid a solution like this in the main sources,
I think I shall go investigate Spencer's new regexp code in Tcl, which
reputedly is designed for wider-than-8-bit chars from the get-go. We've
had it on the TODO list for a long time to assimilate that code; it's
probably time to make it happen.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-02-04 16:59:18 Re: POSIX regex performance bug in 7.3 Vs. 7.2
Previous Message Neil Conway 2003-02-04 16:46:13 Re: POSIX regex performance bug in 7.3 Vs. 7.2