Optimizing Trigram searches in PG 9.1

From: Jonathan Bartlett <jonathan(dot)l(dot)bartlett(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Optimizing Trigram searches in PG 9.1
Date: 2011-09-22 16:40:46
Message-ID: CAHRTq6RKauc_-eUt7NGm7dmfCjbyPsSniOPeiCG3hJR4guHT1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I am working on a fuzzy search of a large dataset. Basically, it is a list
of all of the songs, albums, artists, movies, and celebrities exported from
Freebase. Anyway, I was hoping to get a fuzzy search that was nearly as
fast as the full-text search with the new nearest-neighbor GIST indexes,
but, while it is improved from 9.0, it is still taking some time. The table
has about 16 million rows, each with a "name" column that is usually 2-10
words.

My query using full-text search is this:
explain analyze select * from entities where
to_tsvector('unaccented_english', entities.name) @@
plainto_tsquery('unaccented_english', 'bon jovi');
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on entities (cost=42.64..1375.62 rows=340 width=1274)
(actual time=0.422..0.617 rows=109 loops=1)
Recheck Cond: (to_tsvector('unaccented_english'::regconfig, (name)::text)
@@ '''bon'' & ''jovi'''::tsquery)
-> Bitmap Index Scan on entity_unaccented_name_gin_index
(cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109
loops=1)
Index Cond: (to_tsvector('unaccented_english'::regconfig,
(name)::text) @@ '''bon'' & ''jovi'''::tsquery)
Total runtime: 0.728 ms
(5 rows)

My new query using trigrams is this:
explain analyze select * from entities where name % 'bon jovi';
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on entities (cost=913.73..46585.14 rows=13615 width=1274)
(actual time=7769.380..7772.739 rows=326 loops=1)
Recheck Cond: ((name)::text % 'bon jovi'::text)
-> Bitmap Index Scan on tmp_entity_name_trgm_gist_idx
(cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307
rows=326 loops=1)
Index Cond: ((name)::text % 'bon jovi'::text)
Total runtime: 7773.008 ms

If I put a limit on it, it gets better, but is still pretty bad:
explain analyze select * from entities where name % 'bon jovi' limit 50;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146
rows=50 loops=1)
-> Index Scan using tmp_entity_name_trgm_gist_idx on entities
(cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016
rows=50 loops=1)
Index Cond: ((name)::text % 'bon jovi'::text)
Total runtime: 1226.261 ms
(4 rows)

And if I try to get the "best" matches, the performance goes completely down
the tubes, even with a limit:
explain analyze select * from entities where name % 'bon jovi' order by name
<-> 'bon jovi' limit 50;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..200.39 rows=50 width=1274) (actual
time=421.811..8058.877 rows=50 loops=1)
-> Index Scan using tmp_entity_name_trgm_gist_idx on entities
(cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766
rows=50 loops=1)
Index Cond: ((name)::text % 'bon jovi'::text)
Order By: ((name)::text <-> 'bon jovi'::text)
Total runtime: 8060.760 ms

Anyway, this may just be a limitation of the trigram indexing, but my hope
was to get a fuzzy search that at least approached the performance of the
full-text searching. Am I missing something, or am I just bumping into the
limits? I also noticed that different strings get radically different
performance. Searching for "hello" drops the search time down to 310ms!
But searching for 'hello my friend' brings the search time to 9616ms!
explain analyze select * from entities where name % 'hello' order by name
<-> 'hello' limit 50;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=5.056..309.492
rows=50 loops=1)
-> Index Scan using tmp_entity_name_trgm_gist_idx on entities
(cost=0.00..54566.55 rows=13615 width=1274) (actual time=5.053..309.393
rows=50 loops=1)
Index Cond: ((name)::text % 'hello'::text)
Order By: ((name)::text <-> 'hello'::text)
Total runtime: 309.637 ms

explain analyze select * from entities where name % 'hello my friend' order
by name <-> 'hello my friend' limit 50;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=76.358..9616.066
rows=50 loops=1)
-> Index Scan using tmp_entity_name_trgm_gist_idx on entities
(cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968
rows=50 loops=1)
Index Cond: ((name)::text % 'hello my friend'::text)
Order By: ((name)::text <-> 'hello my friend'::text)
Total runtime: 9616.203 ms

For reference, here is my table structure:
\d entities
Table "public.entities"
Column | Type |
Modifiers
----------------------+-----------------------------+-------------------------------------------------------
id | integer | not null default
nextval('entities_id_seq'::regclass)
name | character varying(255) |
disambiguation | character varying(255) |
description | text |
entity_basic_type | character varying(255) |
entity_extended_type | character varying(255) |
primary | boolean | default true
semantic_world_id | integer |
calc_completed | boolean | default true
source | text |
source_entity_id | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
data_import_id | integer |
validated | boolean | default true
weight | integer |
description_source | text |
description_url | text |
rating | text |
Indexes:
"entities_pkey" PRIMARY KEY, btree (id)
"entity_lower_name_idx" btree (lower(name::text) text_pattern_ops)
"entity_name_gin_index" gin (to_tsvector('english'::regconfig,
name::text))
"entity_unaccented_name_gin_index" gin
(to_tsvector('unaccented_english'::regconfig, name::text))
"index_entities_on_data_import_id" btree (data_import_id)
"index_entities_on_name" btree (name)
"index_entities_on_source" btree (source)
"tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops)

Thanks for the help!

Jon

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Viscuso 2011-09-22 22:55:47 Re: Query optimization using order by and limit
Previous Message Stephen Frost 2011-09-22 14:53:19 Re: Query optimization using order by and limit