Re: pg_trgm version 1.2

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm version 1.2
Date: 2015-06-29 19:34:36
Message-ID: CAHyXU0z1Ht8r276OMrZZ01_g3j+C6eey+1jFFkzwbV62v9bRcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2015 at 7:23 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache
>>
>> V1.2: Time: 2.839 ms --- after repeated execution to warm the cache
>
> Wow! I'm going to test this. I have some data sets for which trigram
> searching isn't really practical...if the search string touches
> trigrams with a lot of duplication the algorithm can have trouble
> beating brute force searches.
>
> trigram searching is important: it's the only way currently to search
> string encoded structures for partial strings quickly.

I ran your patch against stock 9.4 and am happy to confirm massive
speedups of pg_trgm; results of 90% reduction in runtime are common.
Also, with the new changes it's hard to get the indexed search to
significantly underperform brute force searching which is a huge
improvement vs the stock behavior, something that made me very wary of
using these kinds of searches in the past.

datatable: 'test2'
rows: ~ 2 million
heap size: 3.3GB (includes several unrelated fields)
index size: 1GB
9.4: stock
9.5: patched

match 50% rows, brute force seq scan
9.4: 11.5s
9.5: 9.1s

match 50% rows, indexed (time is quite variable with 9.4 giving > 40 sec times)
9.4: 21.0s
9.5: 11.8s

match 1% rows, indexed (>90% time reduction!)
9.4: .566s
9.5: .046s

match .1% rows, one selective one non-selective search term, selective
term first
9.4: .563s
9.5: .028s

match .1% rows, one selective one non-selective search term, selective term last
9.4: 1.014s
9.5: 0.093s

very nice! Recently, I examined pg_tgrm for an attribute searching
system -- it failed due to response time variability and lack of tools
to control that. Were your patch in place, I would have passed it. I
had a 'real world' data set though. With this, pg_trgm is basically
outperforming SOLR search engine for all cases we're interested in
whereas before low selectivity cases where having all kinds of
trouble.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-06-29 19:48:40 Re: Rework the way multixact truncations work
Previous Message Josh Berkus 2015-06-29 17:40:56 Re: Support for N synchronous standby servers - take 2