Re: daitch_mokotoff module

From: Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
To: Dag Lem <dag(at)nimrod(dot)no>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2023-01-12 15:52:17
Message-ID: A2C12929-5B04-4202-95AB-A2970FB5016A@cleverelephant.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 12, 2023, at 7:30 AM, Dag Lem <dag(at)nimrod(dot)no> wrote:
>
> Paul Ramsey <pramsey(at)cleverelephant(dot)ca> writes:
>
>> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag(at)nimrod(dot)no> wrote:
>>
>>> I also improved on the documentation example (using Full Text Search).
>>> AFAIK you can't make general queries like that using arrays, however in
>>> any case I must admit that text arrays seem like more natural building
>>> blocks than space delimited text here.
>>
>> This is a fun addition to fuzzystrmatch.
>
> I'm glad to hear it! :-)
>
>>
>> While it's a little late in the game, I'll just put it out there:
>> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
>> how you feel about that.
>
> I chose the name in order to follow the naming of the other functions in
> fuzzystrmatch, which as far as I can tell are given the name which each
> algorithm is known by.
>
> Personally I don't think it's worth it to deviate from the naming of the
> other functions just to avoid typing a few characters, and I certainly
> don't think daitch_mokotoff is any harder to get right than
> levenshtein_less_equal ;-)

Good points :)

>
>>
>> On the documentation, I found the leap directly into the tsquery
>> example a bit too big. Maybe start with a very simple example,
>>
>> --
>> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
>> daitch_mokotoff('Swartzenegger');
>>
>> daitch_mokotoff | daitch_mokotoff
>> -----------------+-----------------
>> {479465} | {479465}
>> --
>>
>> Then transition into a more complex example that illustrates the GIN
>> index technique you mention in the text, but do not show:
>>
>> --
>> CREATE TABLE dm_gin (source text, dm text[]);
>>
>> INSERT INTO dm_gin (source) VALUES
>> ('Swartzenegger'),
>> ('John'),
>> ('James'),
>> ('Steinman'),
>> ('Steinmetz');
>>
>> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>>
>> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>>
>> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
>> --
>
> Sure, I can do that. You don't think this much example text will be
> TL;DR?

I can only speak for myself, but examples are the meat of documentation learning, so as long as they come with enough explanatory context to be legible it's worth having them, IMO.

>
>>
>> And only then go into the tsearch example. Incidentally, what does the
>> tsearch approach provide that the simple GIN approach does not?
>
> The example shows how to do a simultaneous match on first AND last
> names, where the first and last names (any number of names) are stored
> in the same indexed column, and the order of the names in the index and
> the search term does not matter.
>
> If you were to use the GIN "&&" operator, you would get a match if
> either the first OR the last name matches. If you were to use the GIN
> "@>" operator, you would *not* get a match if the search term contains
> more soundex codes than the indexed name.
>
> E.g. this yields a correct match:
> SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');
>
> While this yields a false positive:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe'));
>
> And this yields a false negative:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') || daitch_mokotoff('Jameson'));
>
> This may explained better by simply showing the output of
> soundex_tsvector and soundex_tsquery:
>
> SELECT soundex_tsvector('John Yamson');
> soundex_tsvector
> ----------------------------------
> '160000':1 '164600':3 '460000':2
>
> SELECT soundex_tsquery('John Jameson');
> soundex_tsquery
> ---------------------------------------------------
> ( '160000' | '460000' ) & ( '164600' | '464600' )
>
>> Ideally explain that briefly before launching into the example. With
>> all the custom functions and so on it's a little involved, so maybe if
>> there's not a huge win in using that approach drop it entirely?
>
> I believe this functionality is quite useful, and that it's actually
> what's called for in many situations. So, I'd rather not drop this
> example.

Sounds good

P

>
>>
>> ATB,
>> P
>>
>
> Best regards,
>
> Dag Lem

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Takamichi Osumi (Fujitsu) 2023-01-12 15:54:10 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Tom Lane 2023-01-12 15:48:46 Re: PG11 to PG14 Migration Slowness