Re: Fastest Index/Algorithm to find similar sentences

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Janek Sendrowski <janek12(at)web(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fastest Index/Algorithm to find similar sentences
Date: 2013-08-20 23:18:09
Message-ID: CAHyXU0zKSRpFVTd3x9uKNf-nK-Dr96+Ot=7_0TiR47_-q0oTRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Aug 2, 2013 at 10:25 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Janek Sendrowski <janek12(at)web(dot)de> wrote:
>
>> I also tried pg_trgm module, which works with tri-grams, but it's
>> also very slow with 100.000+ rows.
>
> Hmm. I found the pg_trgm module very fast for name searches with
> millions of rows *as long as I used KNN-GiST techniques*. Were you
> careful to do so? Check out the "Index Support" section of this
> page:
>
> http://www.postgresql.org/docs/current/static/pgtrgm.html
>
> While I have not tested this technique with a column containing
> sentences, I would expect it to work well. As a quick
> confirmation, I imported the text form of War and Peace into a
> table, with one row per *line* (because that was easier than
> parsing sentence boundaries for a quick test). That was over
> 65,000 rows.

+ 1 this. pg_trgm is black magic. search time (when using index) is
mostly dependent on number of trigrams in search string vs average
number of trigrams in database.

merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Moshe Jacobson 2013-08-20 23:34:15 Re: pg_extension_config_dump() with a sequence
Previous Message andres.pascal 2013-08-20 23:06:08 Re: Fastest Index/Algorithm to find similar sentences