Quick Links

Re: Fuzzy substring searching with the pg_trgm extension

From:	Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fuzzy substring searching with the pg_trgm extension
Date:	2016-01-29 15:58:39
Message-ID:	56AB8C2F.2080609@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 29.01.2016 18:39, Alvaro Herrera wrote:
> Teodor Sigaev wrote:
>>> The behavior of this function is surprising to me.
>>>
>>> select substring_similarity('dog' , 'hotdogpound') ;
>>>
>>> substring_similarity
>>> ----------------------
>>> 0.25
>>>
>> Substring search was desined to search similar word in string:
>> contrib_regression=# select substring_similarity('dog' , 'hot dogpound') ;
>> substring_similarity
>> ----------------------
>> 0.75
>>
>> contrib_regression=# select substring_similarity('dog' , 'hot dog pound') ;
>> substring_similarity
>> ----------------------
>> 1
>
> Hmm, this behavior looks too much like magic to me. I mean, a substring
> is a substring -- why are we treating the space as a special character
> here?
>

I think, I can rename this function to subword_similarity() and correct
the documentation.

The current behavior is developed to find most similar word in a text.
For example, if we will search just substring (not word) then we will
get the following result:

select substring_similarity('dog', 'dogmatist');
substring_similarity
---------------------
1
(1 row)

But this is wrong I think. They are completely different words.

For searching a similar substring (not word) in a text maybe another
function should be added?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Re: Fuzzy substring searching with the pg_trgm extension at 2016-01-29 15:39:51 from Alvaro Herrera

Responses

Re: Fuzzy substring searching with the pg_trgm extension at 2016-02-01 17:12:03 from Artur Zakirov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thom Brown	2016-01-29 16:01:37	Re: [WIP] Effective storage of duplicates in B-tree index.
Previous Message	Aleksander Alekseev	2016-01-29 15:47:33	Re: [WIP] Effective storage of duplicates in B-tree index.