Fwd: pg_trgm word_similarity inconsistencies or bug

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Jan Przemysław Wójcik <jan(dot)przemyslaw(dot)wojcik(at)gmail(dot)com>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, Cristiano Coelho <cristianocca(at)hotmail(dot)com>, pgsql-bugs(at)postgresql(dot)org, François CHAHUNEAU <Francois(dot)CHAHUNEAU(at)numen(dot)fr>, Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr>
Subject: Fwd: pg_trgm word_similarity inconsistencies or bug
Date: 2017-11-06 11:34:24
Message-ID: CAPpHfdsDtQJ+pYfapH7GMt9n4RLByKMzHEBWcPL99C_g+rkUKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi!

I'd like to forward a feedback from our customer who uses word_similarity()
function.
François finds current behavior of word_similarity() to be useful. Thus, I
think we should preserve it. But documentation correction is needed and
option for alternative behavior would be useful too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

---------- Forwarded message ----------
From: François CHAHUNEAU <Francois(dot)CHAHUNEAU(at)numen(dot)fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr>, "foli(at)numen(dot)mg" <
foli(at)numen(dot)mg>

Hello Alexander,

We agree that the current pg_trgm documentation does not correctly reflect
the de facto behavior of word_similarity(), and that something has to be
changed. But to us, it is more a documentation problem than anything else.

What is computed is still « substring_similarity » as was initially
specified between us, but it is influenced by a strong word boundary bias
caused by the way trigrams are padded at word boundaries. This bias was
noticed by early reviewers and you explained that this motivated the name
switch to « word_similarity ». As you will remember, at the time we
discovered this, we were suprised because we considerd this as a slight
misnomer. Indeed, what is currently described in the 9.6 pg_trgm
documentation is inaccurate (although seemingly consistent with this new
name) and has to be amended.

Now, word_similarity() has been out for more than a year and, of course, it
is preferable to avoid any breaking changes… In our case, we consider the
name « unfortunate » and the explanation buggy, not the function itself.

As you may remember from the initial discussion, some other users stressed
the importance to be able to matchsub strings. We tend to agree with what
Jeff Janes wrote in this discussion :

The reason I like the option of not treating word boundaries as

special in this case is that often in scientific vocabulary, and in

catalog part numbers, people are pretty inconsistent about whether

they included spaces. "HEK 293", "HEK293", and "HEK-293" could be all

the same thing. So I like to strip out spaces and punctuation on both

sides of operator. Of course I can't do that if there are invisible

un-removable spaces on the substring side.

But, It doesn't sound like I am going to win that debate. Given that,

I don't think we need a different name for the function. I'm fine with

explaining the word-boundary subtlety in the documentation, and

keeping the function name itself simple.

Now, considering your proposal :

As far as we are concerned, we use <% and %> everyday for efficient fuzzy
matching on large databases. Our typical usage scenario is matching noisy
OCRized text strings against reference databases.

*> 1) Define GUC variable which specifies whether word_similarity() should
force extent boundaries to be at word boundaries,*

Ok for us,* iff* default behavior remains the same as now, for backward
compatibility reasons. We could take advantage, *in some cases*, of the new
« word rounded » behavior controlled by the GUC variable, but this would
not cover all scenarios currently in use.

> 2*) Document both cases of word_similarity() behavior.*

This is clearly needed anyway.

Best regards,

*François CHAHUNEAU*
Directeur des technologies

NUMEN DIGITAL| 24, rue Marc Seguin
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
75018
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
Paris
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
France
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>*
| www.numen.fr
<https://numen.letsignit.com/r/0/991c6b92-d8fe-4afa-95f5-7b74d0322fd9>*
Tel +33 1 40 37 95 03 <+33%201%2040%2037%2095%2003> | Mob +33 6 07 85 21 79
<+33%206%2007%2085%2021%2079> | Fax +33 1 40 37 94 94
<+33%201%2040%2037%2094%2094>
<https://numen.letsignit.com/r/15/57dd0ced-dea8-441a-a066-68bf7cedbecd>
<https://numen.letsignit.com/r/3/9be1fd6e-57d8-4963-bcc7-03151b263433> Pensez
vert, n’imprimez que nécessaire. Les informations contenues dans le présent
e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et
peuvent contenir des informations confidentielles, protégées par un secret
professionnel. L’utilisation de ces informations par d’autres personnes que
le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas
destinataire de ce message, la publication, la reproduction, la diffusion
et /ou la distribution de ces informations auprès de tiers n’est pas
autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en
informer immédiatement, détruire l'email, ses copies et documents joints et
le supprimer.

*De :* Alexander Korotkov [mailto:a(dot)korotkov(at)postgrespro(dot)ru]
*Envoyé :* mardi 31 octobre 2017 16:18
*À :* Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr>; François CHAHUNEAU <
Francois(dot)CHAHUNEAU(at)numen(dot)fr>
*Objet :* Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

Dear, Thierry and François!

PostgreSQL users found inconsistency between documentation and
implementation of word_similarity().

Possible solution proposed by the reporter is to alter the implementation.

But it's important for me that your interests are not affected but
potential further change of implementation of word_similarity().

Could you please share your opinion on changes proposed by Jan in the
pgsql-bugs mailing list?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2017-11-06 14:43:38 Re: BUG #14889: explain analyze is taking much more time than actual execution
Previous Message David Rowley 2017-11-06 08:08:59 Re: BUG #14889: explain analyze is taking much more time than actual execution

Browse pgsql-hackers by date

  From Date Subject
Next Message Raúl Marín Rodríguez 2017-11-06 11:41:06 Re: pow support for pgbench
Previous Message Alexander Korotkov 2017-11-06 11:14:51 Re: Display number of heap accesses for index scans