Wildcard searches & performance question

From: Grega Bremec <grega(dot)bremec(at)noviforum(dot)si>
To: pgsql-performance(at)postgresql(dot)org
Subject: Wildcard searches & performance question
Date: 2003-05-27 19:09:08
Message-ID: 20030527210908.A571@elbereth.noviforum.si
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I have a database with the following layout:

searchterms=# \d+ searches_2002
Table "public.searches_2002"
Column | Type | Modifiers | Description
-----------+------------------------+-----------+-------------
srchdate | date | not null |
srchtime | time without time zone | not null |
client_ip | inet | not null |
srchquery | character varying(50) | not null |
fhvalue | smallint | |
Indexes: searches_2002_client_ip btree (client_ip),
searches_2002_srchdate btree (srchdate),
searches_2002_srchdatetime btree (srchdate, srchtime),
searches_2002_srchquery btree (srchquery),
searches_2002_srchquery_lcase btree (lower(srchquery)),
searches_2002_srchquery_withfh btree (srchquery, fhvalue),
searches_2002_srchtime btree (srchtime)

There are no uniqueness properties that would make it possible for this table
to have a primary key, as it is a list of searches performed on a search
engine and the users' behaviour is, well... umm, odd, to be mild. :)

Also, do note that this is a test table, so nevermind the excessive amount of
indexes - performance is not an issue here, I am still evaluating the need and
benefits of having various indexes on those columns.

The particular case I am interested is this: when executing queries involving
pattern searches using various operators on srchquery, none of the indexes are
used in certain cases, namely those LIKE and regexp filters that start with
a wildcard.

This makes perfect sense, because wildcard pattern searches that start with a
wildcard, can not really benefit from an index scan, because a sequential scan
is probably going to be faster: we are only going to benefit from scanning an
index in those special cases where the wildcard evaluates to a zero-length
string.

One example of a query plan:

searchterms=# explain select count(*)
from searches_2002
where srchquery like '%curriculum%';
QUERY PLAN
--------------------------------------------------------------------------
Aggregate (cost=4583.26..4583.26 rows=1 width=0)
-> Seq Scan on searches_2002 (cost=0.00..4583.26 rows=1 width=0)
Filter: (srchquery ~~ '%curriculum%'::text)

There is 211061 records in this test table, but the real-life tables would
contain a much much larger amount of data, more like 50+ million rows.

This promise of a hell on earth trying to optimize performance makes me wonder:
would there be a sensible way/reason for avoiding sequential scans on queries
that start with a wildcard, and would avoiding sequential scans even be
feasible in such cases? Or in other words, can I somehow optimize LIKE and
regexp queries that start with a wildcard?

TIA,
--
Grega Bremec
System Administration & Development Support
grega.bremec-at-noviforum.si
http://najdi.si/
http://www.noviforum.si/

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Sullivan 2003-05-27 19:37:00 Re: Wildcard searches & performance question
Previous Message Tom Lane 2003-05-23 20:34:48 Simplifying varchar and bpchar behavior