Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re: daitch_mokotoff module
Date: 2022-12-21 09:26:05
Message-ID: ygeh6xp2jma.fsf@sid.nimrod.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andreas,

Thank you for your detailed and constructive review!

I have made a conscientuous effort to address all the issues you point
out, please see comments below.

Andres Freund <andres(at)anarazel(dot)de> writes:

> Hi,
>
> On 2022-02-03 15:27:32 +0100, Dag Lem wrote:

[...]

> [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File
> fuzzystrmatch--1.1.sql does not exist.
>
>
>> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
>> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql
>> fuzzystrmatch--1.0--1.1.sql
>> PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
>
>
> The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
> that. In recent years our approach has been to just keep the "base version" of
> the upgrade script, with extension creation running through the upgrade
> scripts.
>

OK, I have now kept fuzzystrmatch--1.1.sql, and omitted
fuzzystrmatch--1.2.sql

Both the Makefile and meson.build are updated to handle the new files,
including the generated header.

>>
>> +
>> +#include "daitch_mokotoff.h"
>> +
>> +#include "postgres.h"
>
> Postgres policy is that the include of "postgres.h" has to be the first
> include in every .c file.
>
>

OK, fixed.

>> +#include "utils/builtins.h"
>> +#include "mb/pg_wchar.h"
>> +
>> +#include <string.h>
>> +
>> +/* Internal C implementation */
>> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
>> +
>> +
>> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
>> +Datum
>> +daitch_mokotoff(PG_FUNCTION_ARGS)
>> +{
>> + text *arg = PG_GETARG_TEXT_PP(0);
>> + char *string,
>> + *tmp_soundex;
>> + text *soundex;
>> +
>> + /*
>> + * The maximum theoretical soundex size is several KB, however in
>> practice
>> + * anything but contrived synthetic inputs will yield a soundex size of
>> + * less than 100 bytes. We thus allocate and free a temporary work
>> buffer,
>> + * and return only the actual soundex result.
>> + */
>> + string = pg_server_to_any(text_to_cstring(arg),
>> VARSIZE_ANY_EXHDR(arg), PG_UTF8);
>> + tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
>
> Seems that just using StringInfo to hold the soundex output would work better
> than a static allocation?
>

OK, fixed.

>
>> + if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
>
> We imo shouldn't introduce new functions starting with _.
>

OK, fixed. Note that I just followed the existing pattern in
fuzzystrmatch.c there.

[...]

>> +/* Find next node corresponding to code digit, or create a new node. */
>> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
>> + dm_node * node, char code_digit)
>
> PG code style is to have a line break between a function defintion's return
> type and the function name - like you actually do above.
>

OK, fixed. Both pgindent and I must have missed that particular
function.

>> +/* Mapping from ISO8859-1 to upper-case ASCII */
>> +static const char tr_iso8859_1_to_ascii_upper[] =
>> +/*
>> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬
>> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
>> +*/
>> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ !
>> ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
>> +
>> +static char
>> +iso8859_1_to_ascii_upper(unsigned char c)
>> +{
>> + return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
>> +}
>> +
>> +
>> +/* Convert an UTF-8 character to ISO-8859-1.
>> + * Unconvertable characters are returned as '?'.
>> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
>> + */
>> +static char
>> +utf8_to_iso8859_1(char *str, int *ix)
>
> It seems decidedly not great to have custom encoding conversion routines in a
> contrib module. Is there any way we can avoid this?
>

I have now replaced the custom UTF-8 decode with calls to
utf8_to_unicode and pg_utf_mblen, and simplified the subsequent
conversion to ASCII. Hopefully this makes the conversion code more
palatable.

I don't see how the conversion to ASCII could be substantially
simplified further. The conversion maps lowercase and 8 bit ISO8859-1
characters to ASCII via uppercasing, removal of accents, and discarding
of special characters. In addition to that, it maps (the non-ISO8859-1)
Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple
O(1) table lookup can be used to retrieve the soundex code tree for a
letter sequence.

>
>> +/* Generate all Daitch-Mokotoff soundex codes for word, separated
>> by space. */
>> +static char *
>> +_daitch_mokotoff(char *word, char *soundex, size_t n)
>> +{
>> + int i = 0,
>> + j;
>> + int letter_no = 0;
>> + int ix_leaves = 0;
>> + int num_nodes = 0,
>> + num_leaves = 0;
>> + dm_codes *codes,
>> + *next_codes;
>> + dm_node *nodes;
>> + dm_leaves *leaves;
>> +
>> + /* First letter. */
>> + if (!(codes = read_letter(word, &i)))
>> + {
>> + /* No encodable character in input. */
>> + return NULL;
>> + }
>> +
>> + /* Allocate memory for node tree. */
>> + nodes = palloc(sizeof(dm_nodes));
>> + leaves = palloc(2 * sizeof(dm_leaves));
>
> So this allocates the worst case memory usage, is that right? That's quite a
> bit of memory. Shouldn't nodes be allocated dynamically?
>
> Instead of carefully freeing individual memory allocations, I think it be
> better to create a temporary memory context, allocate the necessary nodes etc
> on demand, and destroy the temporary memory context at the end.
>

Yes, the one-time allocation was intended to cover the worst case memory
usage. This was done to avoid any performance hit incurred by allocating
and deallocating memory for each new node in the soundex code tree.

I have rewritten the bookeeping of nodes in the soundex code tree to use
linked lists, and have followed your advice to use a temporary memory
context for allocation.

I also made an optimization by excluding completed soundex nodes from
the next letter iteration. This seems to offset any allocation overhead
- the performance is more or less the same as before.

>
>> +/* Codes for letter sequence at start of name, before a vowel, and
>> any other. */
>> +static dm_codes codes_0_1_X[2] =
>
> Any reason these aren't all const?
>

No reason why they can't be :-) They are now changed to const.

>
> It's not clear to me where the intended line between the .h and .c file is.
>
>
>> +print <<EOF;
>> +/*
>> + * Types and lookup tables for Daitch-Mokotoff Soundex
>> + *
>
> If we generate the code, why is the generated header included in the commit?
>

This was mainly to have the content available for reference without
having to generate the header. I have removed the file - after the
change you suggest below, the struct declarations are available in the
.c file anyway.

>> +/* Letter in input sequence */
>> +struct dm_letter
>> +{
>> + char letter; /* Present letter in sequence */
>> + struct dm_letter *letters; /* List of possible successive letters
>> */
>> + dm_codes *codes; /* Code sequence(s) for complete sequence */
>> +};
>> +
>> +/* Node in soundex code tree */
>> +struct dm_node
>> +{
>> + int soundex_length; /* Length of generated soundex code */
>> + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */
>> + int is_leaf; /* Candidate for complete soundex code */
>> + int last_update; /* Letter number for last update of node */
>> + char code_digit; /* Last code digit, 0 - 9 */
>> +
>> + /*
>> + * One or two alternate code digits leading to this node. If there
>> are two
>> + * digits, one of them is always an 'X'. Repeated code digits and
>> X' lead
>> + * back to the same node.
>> + */
>> + char prev_code_digits[2];
>> + /* One or two alternate code digits moving forward. */
>> + char next_code_digits[2];
>> + /* ORed together code index(es) used to reach current node. */
>> + int prev_code_index;
>> + int next_code_index;
>> + /* Nodes branching out from this node. */
>> + struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
>> +};
>> +
>> +typedef struct dm_letter dm_letter;
>> +typedef struct dm_node dm_node;
>
> Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
> but it seems that the structs could just be defined in the .c file.
>

To accomplish this, I had to rearrange the code a bit. The structs are
now all declared in daitch_mokotoff.c, and the generated header is
included inbetween them.

>
>> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
>
> What does "adapted" mean here? And what's the path to updating the data?
>

It means that the original soundex coding chart, which is referred to,
has been converted to a machine readable format, with a few
modifications. These modifications are outlined further down in the
comments. I expanded a bit on the comments, hopefully making things
clearer.

I don't think there is much to be said about updating the data - that's
simply a question of modifying the table and regenerating the header
file. It goes without saying that making changes requires an
understanding of the soundex coding, which is explained in the
reference. However if anything should be unclear, please do point out
what should be explained better.

> Greetings,
>
> Andres Freund
>

Thanks again, and a Merry Christmas to you and all the other PostgreSQL
hackers!

Best regards,

Dag Lem

Attachment Content-Type Size
v7-daitch_mokotoff.patch text/x-patch 38.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2022-12-21 09:50:12 RE: Force streaming every change in logical decoding
Previous Message Drouvot, Bertrand 2022-12-21 09:06:42 Re: Minimal logical decoding on standbys