Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2023-01-02 21:00:34
Message-ID: ygetu18tzwt.fsf@sid.nimrod.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry about the latest unfinished email - don't know what key
combination I managed to hit there.

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

>
> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.
>

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

[...]

>> BTW Vera 790000 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes. But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 790000 and 794000. (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes. So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras. And, as a Spanish speaker, that
> would make a lot of sense to me.)

It is what it is - we can't call it Daitch-Mokotoff Soundex while
implementing something else. Having said that, one can always pre- or
postprocess to tweak the results.

Daitch-Mokotoff Soundex is known to produce false positives, but that is
in many cases not a problem.

Even though it's clearly tuned for Jewish names, the soundex algorithm
seems to work just fine for European names (we use it to match mostly
Norwegian names).

Best regards

Dag Lem

Attachment Content-Type Size
v11-daitch_mokotoff.patch text/x-patch 38.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-02 21:25:19 Re: An oversight in ExecInitAgg for grouping sets
Previous Message Karl O. Pinc 2023-01-02 20:53:54 Re: doc: add missing "id" attributes to extension packaging page