Re: How to boost performance of queries containing pattern matching characters

From: Artur Zając <azajac(at)ang(dot)com(dot)pl>
To: <gnanam(at)zoniac(dot)com>, "'Richard Huxton'" <dev(at)archonet(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: How to boost performance of queries containing pattern matching characters
Date: 2011-02-14 07:38:48
Message-ID: 003801cbcc1a$37d2f7e0$a778e7a0$@ang.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>How can we boost performance of queries containing pattern matching
>characters? In my case, we're using a percent sign (%) that matches any
string of zero or more characters.
>
> QUERY: DELETE FROM MYTABLE WHERE EMAIL ILIKE '%domain.com%'
>
> EMAIL column is VARCHAR(256).
>
> As it is clear from the above query, email is matched "partially and
case-insensitively", which my application requirement demands.
>
> In case, if it were a full match, I could easily define a functional
> INDEX on EMAIL column (lower(EMAIL)) and I could rewrite my DELETE where
criteria like lower(EMAIL) = 'someemail(at)domain(dot)com'.
>
> MYTABLE currently contains 2 million records and grows consistently.

I had almost the same problem.
To resolve it, I created my own text search parser (myftscfg) which divides
text in column into three letters parts, for example:

someemail(at)domain(dot)com is divided to som, ome,mee,eem,ema,mai,ail,il@,
l(at)d,@do,dom,oma,mai,ain,in.,n.c,.co,com

There should be also index on email column:

CREATE INDEX "email _fts" on mytable using gin
(to_tsvector('myftscfg'::regconfig, email))

Every query like email ilike '%domain.com%' should be rewrited to:

WHERE
to_tsvector('myftscfg',email) @@ to_tsquery('dom') AND
to_tsvector('myftscfg',email) @@ to_tsquery('oma') AND
to_tsvector('myftscfg',email) @@ to_tsquery('mai') AND
to_tsvector('myftscfg',email) @@ to_tsquery('ain') AND
to_tsvector('myftscfg',email) @@ to_tsquery('in.') AND
to_tsvector('myftscfg',email) @@ to_tsquery('n.c') AND
to_tsvector('myftscfg',email) @@ to_tsquery('.co') AND
to_tsvector('myftscfg',email) @@ to_tsquery('com') AND email ILIKE
'%domain.com%';

Index is reducing number of records and clause email ILIKE '%domain.com%' is
selecting only valid records.

I didn't found better solution.

-------------------------------------------
Artur Zajac

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Richard Huxton 2011-02-14 07:39:33 Re: How to boost performance of queries containing pattern matching characters
Previous Message Gnanakumar 2011-02-14 07:28:23 Re: How to boost performance of queries containing pattern matching characters