Re: [GENERAL] indexed regex select optimisation missing?

From: "Ross J(dot) Reedstrom" <reedstrm(at)wallace(dot)ece(dot)rice(dot)edu>
To: Stuart Woolford <stuartw(at)newmail(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] indexed regex select optimisation missing?
Date: 1999-11-04 22:06:21
Message-ID: 19991104160621.B11924@wallace.ece.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Ah, your description just tripped a memory for me from the hackers list:

The behavior you describe has to do with the implementation of using an
index for regex matching, in the presence of the USE_LOCALE configuration
option.

Internally, the condition: WHERE word~'^alongword' is converted in the
parser(!) to:

WHERE word >= 'alongword' AND word < 'alongword\377'

since the index needs inequalities to be used, not matches. Now, the
problem is the hack of tacking an octal \377 on the string to create
the lexagraphically 'just bigger' value assumes ASCI sort order. If
USE_LOCALE is defined, this is dropped, since we don't have a good fix
yet, and slow correct behavior is better than fast, incorrect behavior.

So, you have two options: if you don't need locale support, recompile
without it. Otherwise, hand code your anchored matches as the pair of
conditionals above Hmm, is there syntax for adding an arbitrary value to
a string constant in the SQL? I suppose you could use: word < 'alongwore',
i.e. hand increment the last character, so it's larger than any match.

Your point is correct, the developers are aware of it as a theoretical
problem, at least. Always helps to hear a real world case, though. I
believe it's on the TODO list as is, otherwise, pester Bruce. ;-)

Reviewing my email logs from June, most of the work on this has to do with
people who needs locales, and potentially multibyte character sets. Tom
Lane is of the opinion that this particular optimization needs to be moved
out of the parser, and deeper into the planner or optimizer/rewriter,
so a good fix may be some ways out.

Ross

On Fri, Nov 05, 1999 at 10:12:06AM +1300, Stuart Woolford wrote:
>
> My point is that, while the index (in 6.5.1 and 6.5.2, anyway) is used to locate
> the start of the scan, the system is then index-scanning the *whole* rest of the
> table (which takes minutes for my 1.6 million entry table if it is from near
> the start), as opposed to using a better 'stop term' to stop scanning once the
> regex will no longer be able to match (ie: the static front of the regex is no
> longer matching), so the ordered scan is only being half utilised, this makes a
> MASSIVE difference in performance.
>
> For example, say one of the words in the table is 'alongword', and there is
> also 'alongwords', but no other words with the root of 'alongword'
>

[...]

>
> If I do a 'select key from inv_word_i where word~'^alongword' it uses the
> index to find 'alongword', then does an index scan of the *whole* rest of the
> table check all the rest of the entries for regex matching, so it takes a long
> time, and returns the two entries detailed above, it will take almost as long
> as the previous query.
>
> What it should do is stop as soon as the leftmost part of the regex match no
> longer matches 'alongword' because, as it is scanning in indexed order, a match
> is no longer possible. The query will then run at nearly the speed of the first

--
Ross J. Reedstrom, Ph.D., <reedstrm(at)rice(dot)edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Stuart Woolford 1999-11-04 23:09:19 Re: [GENERAL] indexed regex select optimisation missing?
Previous Message Stuart Woolford 1999-11-04 21:12:06 Re: [GENERAL] indexed regex select optimisation missing?

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 1999-11-04 22:22:42 New version of psql
Previous Message Bruce Momjian 1999-11-04 21:55:56 Re: [HACKERS] psql Week 4.142857