From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: strange behavior of pg_trgm's similarity function |
Date: | 2013-10-10 18:52:42 |
Message-ID: | CAHGQGwFpjL9JfQL_vyc_nybXjvJaor5VaHfZNR2jesTEE=CUaQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 10, 2013 at 11:00 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Oct 10, 2013 at 7:12 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> On 10.10.2013 15:03, Fujii Masao wrote:
>>>
>>> Hi,
>>>
>>> The behavior of pg_trgm's similarity function seems strange. Is this
>>> intentional?
>>>
>>> I was thinking that the following three calls of the similarity function
>>> return
>>> the same number because the second argument is just the three characters
>>> contained in the first argument in every calls.
>>>
>>> =# SELECT similarity('12345', '123');
>>> =# SELECT similarity('12345', '234');
>>> =# SELECT similarity('12345', '345');
>>>
>>> But that's not true. Each returns the different number.
>>>
>>> =# SELECT similarity('12345', '123');
>>> similarity
>>> ------------
>>> 0.428571
>>> (1 row)
>>>
>>> =# SELECT similarity('12345', '234');
>>> similarity
>>> ------------
>>> 0.111111
>>> (1 row)
>>>
>>> =# SELECT similarity('12345', '345');
>>> similarity
>>> ------------
>>> 0.25
>>> (1 row)
>>>
>>> This happens because, for example, similarity('12345', '123') returns
>>> the similarity number of '**12345*' and '**123*' (* means the blank
>>> character),
>>> NOT '12345' and '123'. IOW, two and one blank characters are added into
>>> the heading and tailing of each argument, respectively. I wonder why
>>> pg_trgm's similarity function works in this way. We should change this
>>> so that no blank characters are added into the arguments?
>>
>>
>> Well, you could also argue that "111111" and "222222" are quite similar,
>> even though pg_trgm's similarity will not think so. It comes down to the
>> definition of similarity, and how well that definition matches your
>> intuition.
>>
>> FWIW, it feels right to me that a match in the beginning of a word is worth
>> more than one in the middle of a string. -1 on changing that.
Okay, understood.
> I'm not so sure that the assumption that leading trigrams should
> effectively weight > 3x is a good one to build into the library.
> However, the behavior is clearly documented and can't be changed. I
> think you'd need to improvise an alternate set of "trigram ops" if you
> wanted to rig an alternate matching behavior.
Yeah, this makes sense.
Regards,
--
Fujii Masao
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2013-10-10 18:53:21 | Re: Auto-tuning work_mem and maintenance_work_mem |
Previous Message | Magnus Hagander | 2013-10-10 18:49:17 | Re: Auto-tuning work_mem and maintenance_work_mem |