Re: Fuzzy substring searching with the pg_trgm extension

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fuzzy substring searching with the pg_trgm extension
Date: 2016-01-15 21:07:39
Message-ID: CAMkU=1zynKQfkR-J2_Uq8Lzp0uho8i+LEdFwGt77CzK_tNtN7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 18, 2015 at 11:43 AM, Artur Zakirov
<a(dot)zakirov(at)postgrespro(dot)ru> wrote:
> Hello.
>
> PostgreSQL has a contrib module named pg_trgm. It is used to the fuzzy text
> search. It provides some functions and operators for determining the
> similarity of the given texts using trigram matching.
>
> At the moment, in pg_trgm both the similarity function and the % operator
> match two strings expecting that they are similar entirely. But they give
> bad results if we want to find documents by a query which is substring of a
> document.
>
> For this purpose we need a new operator which enables to find strings that
> have a fragment most similar to the given string. This patch adds some
> functions and operator:
> - function substring_similarity - Returns a number that indicates how
> similar the first string to the most similar substring of the second string.
> The range of the result is zero (indicating that the two strings are
> completely dissimilar) to one (indicating that the first string is identical
> to substring of the second substring).

The behavior of this function is surprising to me.

select substring_similarity('dog' , 'hotdogpound') ;

substring_similarity
----------------------
0.25

The most similar substring of the second string would be 'dog', which
has a similarity of 1.0 if it were an isolated string.

I think that this function should refrain from calculating the
space-padded trigrams at the beginning and end of its first argument.
Since the first argument is expected to be evaluated as a substring, I
wouldn't take it for granted that the beginning and end of the string
as given are at the beginning and end of a 'word', which is what is
currently being enforced.

Or perhaps the 2nd argument needs to have ghost trigrams computed at
each start and stop when comparing that substring to the full
argument.

If we prefer the current behavior, then I think the explanation needs
to be changed.

Also, should we have a function which indicates the position in the
2nd string at which the most similar match to the 1st argument occurs?

select substring_similarity_pos('dog' , 'hotdogpound') ;

answering: 4

When the long string is much longer, it can be hard to figure out
where the best match was (or the first such match, if there is a
series of ties).

Finally, should there be operators for returning the distance,
analogous to the <-> operator? It is awkward to use an operator for
the boolean and a function to get the distance. Especially when you
use %> and so need swap the order of arguments between the operator
and the function.

We could call them <<-> and <->> , where the first corresponds to <%
and the second to %>

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2016-01-15 21:59:02 Re: GIN pending list clean up exposure to SQL
Previous Message Andres Freund 2016-01-15 21:02:13 Re: checkpointer continuous flushing