Quick Links

Re: Full Text Search combined with Fuzzy

From:	Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To:	Nicolas Paris <niparisco(at)gmail(dot)com>, Oleg Bartunov <obartunov(at)gmail(dot)com>
Cc:	Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Full Text Search combined with Fuzzy
Date:	2017-03-03 13:08:19
Message-ID:	93e265be-81d7-a65f-4ad4-5eb816fd75a9@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 03.03.2017 15:49, Nicolas Paris wrote:
>
> Hi Oleg,
>
> Thanks. I thought pgtrgm was not able to index my long texts because of
> limitation of 8191 bytes per index row for btree.
>
> Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> My final use case is phrase mining in texts.
>
> I want my application returns texts that contains approximatly the user
> entry:
>
> Eg: user search "Hello Word"
> a text containing "blah blah blah hello world blah blah blah" would be
> returned.
>
> Test:
> postgres=# CREATE table test_trgm (texts text);
> CREATE TABLE
> postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> CREATE INDEX
> postgres=# SET enable_seqscan = OFF;
> SET
> postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
> INSERT 0 1
> postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
> INSERT 0 1
> postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> texts | similarity
> -------------------------------------------+------------
> blah blah blah hello world blah blah blah | 0.473684
> blah blah blah hello word blah blah blah | 0.6875
> (2 rows)
>
> postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> QUERY PLAN
> -----------------------------------------------------------------------------------
> Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32)
> Recheck Cond: (texts % 'hello word'::text)
> -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0)
> Index Cond: (texts % 'hello word'::text)
> (4 rows)
>
> Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> my requirements ?
>
> Thanks for the help !
>

Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
example:

postgres=# SELECT texts, word_similarity('hello word', texts) FROM
test_trgm WHERE 'hello word' <% texts;
texts | word_similarity
-------------------------------------------+-----------------
blah blah blah hello world blah blah blah | 0.818182
blah blah blah hello word blah blah blah | 1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Re: Full Text Search combined with Fuzzy at 2017-03-03 12:49:05 from Nicolas Paris

Responses

Re: Full Text Search combined with Fuzzy at 2017-03-03 13:17:26 from Nicolas Paris

Browse pgsql-general by date

	From	Date	Subject
Next Message	Nicolas Paris	2017-03-03 13:17:26	Re: Full Text Search combined with Fuzzy
Previous Message	Geoff Winkless	2017-03-03 12:58:02	Re: ERROR: functions in index expression must be marked IMMUTABLE