Quick Links

Re: Fastest Index/Algorithm to find similar sentences

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Janek Sendrowski <janek12(at)web(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Fastest Index/Algorithm to find similar sentences
Date:	2013-08-20 23:18:09
Message-ID:	CAHyXU0zKSRpFVTd3x9uKNf-nK-Dr96+Ot=7_0TiR47_-q0oTRg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Fri, Aug 2, 2013 at 10:25 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Janek Sendrowski <janek12(at)web(dot)de> wrote:
>
>> I also tried pg_trgm module, which works with tri-grams, but it's
>> also very slow with 100.000+ rows.
>
> Hmm. I found the pg_trgm module very fast for name searches with
> millions of rows *as long as I used KNN-GiST techniques*. Were you
> careful to do so? Check out the "Index Support" section of this
> page:
>
> http://www.postgresql.org/docs/current/static/pgtrgm.html
>
> While I have not tested this technique with a column containing
> sentences, I would expect it to work well. As a quick
> confirmation, I imported the text form of War and Peace into a
> table, with one row per *line* (because that was easier than
> parsing sentence boundaries for a quick test). That was over
> 65,000 rows.

+ 1 this. pg_trgm is black magic. search time (when using index) is
mostly dependent on number of trigrams in search string vs average
number of trigrams in database.

merlin

In response to

Re: Fastest Index/Algorithm to find similar sentences at 2013-08-02 15:25:12 from Kevin Grittner

Browse pgsql-general by date

	From	Date	Subject
Next Message	Moshe Jacobson	2013-08-20 23:34:15	Re: pg_extension_config_dump() with a sequence
Previous Message	andres.pascal	2013-08-20 23:06:08	Re: Fastest Index/Algorithm to find similar sentences