Re: Optimizing Trigram searches in PG 9.1

From: Dave Crooke <dcrooke(at)gmail(dot)com>
To: Jonathan Bartlett <jonathan(dot)l(dot)bartlett(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Optimizing Trigram searches in PG 9.1
Date: 2011-09-22 23:03:51
Message-ID: CALi4UpiPzUL=hMi9HdK7HA4+vWdihYtLxG6_uW2NedeBzyi74w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Depending on your needs, you might consider putting the data into a columnar
text search engine like Lucene, having it return the integer id's which can
then be used for row lookups in PG.

On Thu, Sep 22, 2011 at 11:40 AM, Jonathan Bartlett <
jonathan(dot)l(dot)bartlett(at)gmail(dot)com> wrote:

> I am working on a fuzzy search of a large dataset. Basically, it is a list
> of all of the songs, albums, artists, movies, and celebrities exported from
> Freebase. Anyway, I was hoping to get a fuzzy search that was nearly as
> fast as the full-text search with the new nearest-neighbor GIST indexes,
> but, while it is improved from 9.0, it is still taking some time. The table
> has about 16 million rows, each with a "name" column that is usually 2-10
> words.
>
> My query using full-text search is this:
> explain analyze select * from entities where
> to_tsvector('unaccented_english', entities.name) @@
> plainto_tsquery('unaccented_english', 'bon jovi');
> QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on entities (cost=42.64..1375.62 rows=340 width=1274)
> (actual time=0.422..0.617 rows=109 loops=1)
> Recheck Cond: (to_tsvector('unaccented_english'::regconfig,
> (name)::text) @@ '''bon'' & ''jovi'''::tsquery)
> -> Bitmap Index Scan on entity_unaccented_name_gin_index
> (cost=0.00..42.56 rows=340 width=0) (actual time=0.402..0.402 rows=109
> loops=1)
> Index Cond: (to_tsvector('unaccented_english'::regconfig,
> (name)::text) @@ '''bon'' & ''jovi'''::tsquery)
> Total runtime: 0.728 ms
> (5 rows)
>
> My new query using trigrams is this:
> explain analyze select * from entities where name % 'bon jovi';
> QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Bitmap Heap Scan on entities (cost=913.73..46585.14 rows=13615
> width=1274) (actual time=7769.380..7772.739 rows=326 loops=1)
> Recheck Cond: ((name)::text % 'bon jovi'::text)
> -> Bitmap Index Scan on tmp_entity_name_trgm_gist_idx
> (cost=0.00..910.33 rows=13615 width=0) (actual time=7769.307..7769.307
> rows=326 loops=1)
> Index Cond: ((name)::text % 'bon jovi'::text)
> Total runtime: 7773.008 ms
>
> If I put a limit on it, it gets better, but is still pretty bad:
> explain analyze select * from entities where name % 'bon jovi' limit 50;
>
> QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..200.14 rows=50 width=1274) (actual time=1.246..1226.146
> rows=50 loops=1)
> -> Index Scan using tmp_entity_name_trgm_gist_idx on entities
> (cost=0.00..54498.48 rows=13615 width=1274) (actual time=1.243..1226.016
> rows=50 loops=1)
> Index Cond: ((name)::text % 'bon jovi'::text)
> Total runtime: 1226.261 ms
> (4 rows)
>
> And if I try to get the "best" matches, the performance goes completely
> down the tubes, even with a limit:
> explain analyze select * from entities where name % 'bon jovi' order by
> name <-> 'bon jovi' limit 50;
>
> QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..200.39 rows=50 width=1274) (actual
> time=421.811..8058.877 rows=50 loops=1)
> -> Index Scan using tmp_entity_name_trgm_gist_idx on entities
> (cost=0.00..54566.55 rows=13615 width=1274) (actual time=421.808..8058.766
> rows=50 loops=1)
> Index Cond: ((name)::text % 'bon jovi'::text)
> Order By: ((name)::text <-> 'bon jovi'::text)
> Total runtime: 8060.760 ms
>
> Anyway, this may just be a limitation of the trigram indexing, but my hope
> was to get a fuzzy search that at least approached the performance of the
> full-text searching. Am I missing something, or am I just bumping into the
> limits? I also noticed that different strings get radically different
> performance. Searching for "hello" drops the search time down to 310ms!
> But searching for 'hello my friend' brings the search time to 9616ms!
> explain analyze select * from entities where name % 'hello' order by name
> <-> 'hello' limit 50;
>
> QUERY PLAN
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..200.39 rows=50 width=1274) (actual time=5.056..309.492rows=50 loops=1)
> -> Index Scan using tmp_entity_name_trgm_gist_idx on entities
> (cost=0.00..54566.55 rows=13615 width=1274) (actual time=5.053..309.393rows=50 loops=1)
> Index Cond: ((name)::text % 'hello'::text)
> Order By: ((name)::text <-> 'hello'::text)
> Total runtime: 309.637 ms
>
> explain analyze select * from entities where name % 'hello my friend' order
> by name <-> 'hello my friend' limit 50;
>
> QUERY PLAN
>
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..200.39 rows=50 width=1274) (actual
> time=76.358..9616.066 rows=50 loops=1)
> -> Index Scan using tmp_entity_name_trgm_gist_idx on entities
> (cost=0.00..54566.55 rows=13615 width=1274) (actual time=76.356..9615.968
> rows=50 loops=1)
> Index Cond: ((name)::text % 'hello my friend'::text)
> Order By: ((name)::text <-> 'hello my friend'::text)
> Total runtime: 9616.203 ms
>
> For reference, here is my table structure:
> \d entities
> Table "public.entities"
> Column | Type |
> Modifiers
>
> ----------------------+-----------------------------+-------------------------------------------------------
> id | integer | not null default
> nextval('entities_id_seq'::regclass)
> name | character varying(255) |
> disambiguation | character varying(255) |
> description | text |
> entity_basic_type | character varying(255) |
> entity_extended_type | character varying(255) |
> primary | boolean | default true
> semantic_world_id | integer |
> calc_completed | boolean | default true
> source | text |
> source_entity_id | integer |
> created_at | timestamp without time zone |
> updated_at | timestamp without time zone |
> data_import_id | integer |
> validated | boolean | default true
> weight | integer |
> description_source | text |
> description_url | text |
> rating | text |
> Indexes:
> "entities_pkey" PRIMARY KEY, btree (id)
> "entity_lower_name_idx" btree (lower(name::text) text_pattern_ops)
> "entity_name_gin_index" gin (to_tsvector('english'::regconfig,
> name::text))
> "entity_unaccented_name_gin_index" gin
> (to_tsvector('unaccented_english'::regconfig, name::text))
> "index_entities_on_data_import_id" btree (data_import_id)
> "index_entities_on_name" btree (name)
> "index_entities_on_source" btree (source)
> "tmp_entity_name_trgm_gist_idx" gist (name gist_trgm_ops)
>
> Thanks for the help!
>
> Jon
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Stephen Frost 2011-09-22 23:14:56 Re: Query optimization using order by and limit
Previous Message Michael Viscuso 2011-09-22 22:55:47 Re: Query optimization using order by and limit