Quick Links

Re: daitch_mokotoff module

From:	Dag Lem <dag(at)nimrod(dot)no>
To:	Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: daitch_mokotoff module
Date:	2023-01-12 15:30:39
Message-ID:	ygeh6wvsrc0.fsf@sid.nimrod.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Paul Ramsey <pramsey(at)cleverelephant(dot)ca> writes:

> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag(at)nimrod(dot)no> wrote:
>
>> I also improved on the documentation example (using Full Text Search).
>> AFAIK you can't make general queries like that using arrays, however in
>> any case I must admit that text arrays seem like more natural building
>> blocks than space delimited text here.
>
> This is a fun addition to fuzzystrmatch.

I'm glad to hear it! :-)

>
> While it's a little late in the game, I'll just put it out there:
> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
> how you feel about that.

I chose the name in order to follow the naming of the other functions in
fuzzystrmatch, which as far as I can tell are given the name which each
algorithm is known by.

Personally I don't think it's worth it to deviate from the naming of the
other functions just to avoid typing a few characters, and I certainly
don't think daitch_mokotoff is any harder to get right than
levenshtein_less_equal ;-)

So, if I were to decide, I wouldn't change the name of the function.
However I'm obviously not calling the shots on what goes into PostgreSQL
- perhaps someone else would like to weigh in on this?

>
> On the documentation, I found the leap directly into the tsquery
> example a bit too big. Maybe start with a very simple example,
>
> --
> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
> daitch_mokotoff('Swartzenegger');
>
> daitch_mokotoff | daitch_mokotoff
> -----------------+-----------------
> {479465} | {479465}
> --
>
> Then transition into a more complex example that illustrates the GIN
> index technique you mention in the text, but do not show:
>
> --
> CREATE TABLE dm_gin (source text, dm text[]);
>
> INSERT INTO dm_gin (source) VALUES
> ('Swartzenegger'),
> ('John'),
> ('James'),
> ('Steinman'),
> ('Steinmetz');
>
> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>
> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>
> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
> --

Sure, I can do that. You don't think this much example text will be
TL;DR?

>
> And only then go into the tsearch example. Incidentally, what does the
> tsearch approach provide that the simple GIN approach does not?

The example shows how to do a simultaneous match on first AND last
names, where the first and last names (any number of names) are stored
in the same indexed column, and the order of the names in the index and
the search term does not matter.

If you were to use the GIN "&&" operator, you would get a match if
either the first OR the last name matches. If you were to use the GIN
"@>" operator, you would *not* get a match if the search term contains
more soundex codes than the indexed name.

E.g. this yields a correct match:
SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');

While this yields a false positive:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe'));

And this yields a false negative:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') || daitch_mokotoff('Jameson'));

This may explained better by simply showing the output of
soundex_tsvector and soundex_tsquery:

SELECT soundex_tsvector('John Yamson');
soundex_tsvector
----------------------------------
'160000':1 '164600':3 '460000':2

SELECT soundex_tsquery('John Jameson');
soundex_tsquery
---------------------------------------------------
( '160000' | '460000' ) & ( '164600' | '464600' )

> Ideally explain that briefly before launching into the example. With
> all the custom functions and so on it's a little involved, so maybe if
> there's not a huge win in using that approach drop it entirely?

I believe this functionality is quite useful, and that it's actually
what's called for in many situations. So, I'd rather not drop this
example.

>
> ATB,
> P
>

Best regards,

Dag Lem

In response to

Re: daitch_mokotoff module at 2023-01-11 20:40:31 from Paul Ramsey

Responses

Re: daitch_mokotoff module at 2023-01-12 15:52:17 from Paul Ramsey

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2023-01-12 15:30:44	Re: split TOAST support out of postgres.h
Previous Message	David G. Johnston	2023-01-12 15:21:57	Re: Named Operators