Re: Patch: pg_trgm: gin index scan performance for similarity search

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Fornaroli Christophe <cfornaro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: pg_trgm: gin index scan performance for similarity search
Date: 2015-12-25 10:10:44
Message-ID: 567D1624.6090702@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good catch, committed.

Fornaroli Christophe wrote:
> Hi,
>
> I think that we can improve the gin index scan performance for similarity search
> defined in the pg_trgm extension. The similarity function is (for the default
> case where DIVUNION is defined in the code):
>
> count / (len1 + len2 - count) >= trgm_limit
>
> where
> len1 is the number of unique trigrams for the first string,
> len2 is the same number for the second string,
> count is the number of common trigrams between both strings,
> trgm_limit is a user specfied limit in [0, 1].
>
> The code used to determine if a tuple may match the query string is:
>
> case SimilarityStrategyNumber:
> /* Count the matches */
> ntrue = 0;
> for (i = 0; i < nkeys; i++)
> {
> if (check[i] != GIN_FALSE)
> ntrue++;
> }
> #ifdef DIVUNION
> res = (nkeys == ntrue) ? GIN_MAYBE : (((((float4) ntrue) /
> ((float4) (nkeys - ntrue))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #else
> res = (nkeys == 0) ? GIN_FALSE : (((((float4) ntrue) / ((float4)
> nkeys)) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
> #endif
>
> where
> ntrue is the number of common trigrams in both strings,
> nkeys is the number of trigrams in the search string.
>
> This code uses this upper bound for the similarity: ntrue / (nkeys - ntrue). But
> if there is ntrue trigrams in common, we know that the indexed string is at
> least ntrue trigrams long. We can then use a more aggressive upper bound: ntrue
> / (ntrue + nkeys - ntrue) or ntrue / nkeys. Attached is a patch that changes this.
>
> Here are some performance gains with this test case:
>
> create table foo as select
> substring(md5(random()::text) for random() * 5) || '123' as bar
> from generate_series(1,1000000);
>
> create index on foo using gin (bar gin_trgm_ops);
>
> patched:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=807.434..807.435 rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=109.893..787.261 rows=54746 loops=1)
> Recheck Cond: (bar % 'abc123'::text)
> Rows Removed by Index Recheck: 55125
> Heap Blocks: exact=4514
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=108.456..108.456 rows=109871 loops=1)
> Index Cond: (bar % 'abc123'::text)
> Planning time: 0.353 ms
> Execution time: 807.593 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=4.829..4.830
> rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=3.512..4.794 rows=5 loops=1)
> Recheck Cond: (bar % 'abcdef'::text)
> Rows Removed by Index Recheck: 137
> Heap Blocks: exact=139
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=3.355..3.355 rows=142 loops=1)
> Index Cond: (bar % 'abcdef'::text)
> Planning time: 0.363 ms
> Execution time: 5.061 ms
> (9 rows)
>
>
> master:
>
> test=# explain analyze select count(*) from foo where bar % 'abc123';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual
> time=6416.554..6416.554 rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=484.359..6389.819 rows=54746 loops=1)
> Recheck Cond: (bar % 'abc123'::text)
> Rows Removed by Index Recheck: 945250
> Heap Blocks: exact=4514
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=482.677..482.677 rows=999996 loops=1)
> Index Cond: (bar % 'abc123'::text)
> Planning time: 0.359 ms
> Execution time: 6416.945 ms
> (9 rows)
>
> test=# explain analyze select count(*) from foo where bar % 'abcdef';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=2511.14..2511.15 rows=1 width=0) (actual time=30.678..30.679
> rows=1 loops=1)
> -> Bitmap Heap Scan on foo (cost=99.75..2508.64 rows=1000 width=0) (actual
> time=9.020..30.643 rows=5 loops=1)
> Recheck Cond: (bar % 'abcdef'::text)
> Rows Removed by Index Recheck: 2789
> Heap Blocks: exact=2110
> -> Bitmap Index Scan on foo_bar_idx (cost=0.00..99.50 rows=1000
> width=0) (actual time=7.696..7.696 rows=2794 loops=1)
> Index Cond: (bar % 'abcdef'::text)
> Planning time: 0.254 ms
> Execution time: 30.809 ms
> (9 rows)
>
>
> Cheers,
>
> Christophe
>
>
>

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fornaroli Christophe 2015-12-25 10:19:48 Re: Patch: pg_trgm: gin index scan performance for similarity search
Previous Message Etsuro Fujita 2015-12-25 10:00:35 Re: Optimization for updating foreign tables in Postgres FDW