Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2021-12-21 21:41:18
Message-ID: yge7dbxhc01.fsf@sid.nimrod.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello again,

It turns out that there actually exists an(other) implementation of
the Daitch-Mokotoff Soundex System which gets it right; the JOS
Soundex Calculator at https://www.jewishgen.org/jos/jossound.htm
Other implementations I have been able to find, like the one in Apache
Commons Coded used in e.g. Elasticsearch, are far from correct.

The source code for the JOS Soundex Calculator is not available, as
far as I can tell, however I have run the complete list of 98412 last
names at
https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/v1/last_names.all.txt
through the calculator, in order to have a good basis for comparison.

This revealed a few shortcomings in my implementation. In particular I
had to go back to the drawing board in order to handle the dual nature
of "J" correctly. "J" can be either a vowel or a consonant in
Daitch-Mokotoff soundex, and this complicates encoding of the
*previous* character.

I have also done a more thorough review and refactoring of the code,
which should hopefully make things quite a bit more understandable to
others.

The changes are summarized as follows:

* Returns NULL for input without any encodable characters.
* Uses the same "unoffical" rules for "UE" as other implementations.
* Correctly considers "J" as either a vowel or a consonant.
* Corrected encoding for e.g. "HANNMANN".
* Code refactoring and comments for readability.
* Better examples in the documentation.

The implementation is now in correspondence with the JOS Soundex
Calculator for the 98412 last names mentioned above, with only the
following exceptions:

JOS: cedeño 430000 530000
PG: cedeño 436000 536000
JOS: sadab(khura) 437000
PG: sadab(khura) 437590

I hope this addition to the fuzzystrmatch extension module will prove
to be useful to others as well!

This is my very first code contribution to PostgreSQL, and I would be
grateful for any advice on how to proceed in order to get the patch
accepted.

Best regards

Dag Lem

Attachment Content-Type Size
v4-daitch_mokotoff.patch text/x-patch 49.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-12-21 21:56:30 Re: do only critical work during single-user vacuum?
Previous Message John Naylor 2021-12-21 21:35:05 Re: do only critical work during single-user vacuum?