Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2021-12-14 22:34:00
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:


> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.

Thank you for the constructive review!

> 1) Can the extension be marked as trusted, just like fuzzystrmatch?

I have now moved the daitch_mokotoff function into the fuzzystrmatch
module, as suggested by Andrew Dunstan.

> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.

Please see the updated documentation for the fuzzystrmatch module.

> 3) What's for? I mean, it generates the
> header, but when do we need to run it?

It only has to be run if the soundex rules are changed. I have now
made the dependencies explicit in the fuzzystrmatch Makefile.

> 4) It seems to require perl-open, which is a module we did not need
> until now. Not sure how well supported it is, but maybe we can use a
> more standard module?

I believe Perl I/O layers have been part of Perl core for two decades
now :-)

> 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of
> testing, but our regression tests certainly don't need it (or the
> palloc mockup). I suggest to get rid of it.

Done. BTW this was modeled after dmetaphone.c

> 6) I really don't understand some of the comments in
> daitch_mokotov.sql, like for example:
> -- testEncodeBasic
> -- Tests covered above are omitted.
> Also, comments with names of Java methods seem pretty confusing. It'd
> be better to actually explain what rules are the tests checking.

The tests were copied from various web sites and implementations. I have
cut down on the number of tests and made the comments more to the point.

> 7) There are almost no comments in the .c file (ignoring the comment
> on top). Short functions like initialize_node are probably fine
> without one, but e.g. update_node would deserve one.

More comments are added to both the .h and the .c file.

> 8) Some of the lines are pretty long (e.g. the update_node signature
> is almost 170 chars). That should be wrapped. Maybe try running
> pgindent on the code, that'll show which parts need better formatting
> (so as not to get broken later).

Fixed. I did run pgindent earlier, however it didn't catch those long

> 9) I'm sure there's better way to get the number of valid chars than this:
> for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' ||
> c > ']'); i += ilen)
> {
> }
> Say, a while loop or something?

The code gets to the next encodable character, skipping any other
characters. I have now added a comment which should hopefully make this
clearer, and broken up the for loop for readability.

Please find attached the revised patch.

Best regards

Dag Lem

Attachment Content-Type Size
v3-daitch_mokotoff.patch text/x-patch 46.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-12-14 23:00:34 generalized conveyor belt storage
Previous Message Tom Lane 2021-12-14 22:30:11 Re: Remove pg_strtouint64(), use strtoull() directly