Re: daitch_mokotoff module

From: Andres Freund <andres(at)anarazel(dot)de>
To: Dag Lem <dag(at)nimrod(dot)no>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>
Subject: Re: daitch_mokotoff module
Date: 2022-12-07 18:56:56
Message-ID: 20221207185656.nlaj2o5kcbgbadcp@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-03 15:27:32 +0100, Dag Lem wrote:
> Just some minor adjustments to the patch:
>
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch currently fails in cfbot, likely because meson.build needs to be
adjusted (this didn't exist at the time you submitted this version of the
patch):

[23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File fuzzystrmatch--1.1.sql does not exist.

> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
> PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
that. In recent years our approach has been to just keep the "base version" of
the upgrade script, with extension creation running through the upgrade
scripts.

>
> +
> +#include "daitch_mokotoff.h"
> +
> +#include "postgres.h"

Postgres policy is that the include of "postgres.h" has to be the first
include in every .c file.

> +#include "utils/builtins.h"
> +#include "mb/pg_wchar.h"
> +
> +#include <string.h>
> +
> +/* Internal C implementation */
> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
> +
> +
> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
> +Datum
> +daitch_mokotoff(PG_FUNCTION_ARGS)
> +{
> + text *arg = PG_GETARG_TEXT_PP(0);
> + char *string,
> + *tmp_soundex;
> + text *soundex;
> +
> + /*
> + * The maximum theoretical soundex size is several KB, however in practice
> + * anything but contrived synthetic inputs will yield a soundex size of
> + * less than 100 bytes. We thus allocate and free a temporary work buffer,
> + * and return only the actual soundex result.
> + */
> + string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
> + tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);

Seems that just using StringInfo to hold the soundex output would work better
than a static allocation?

> + if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))

We imo shouldn't introduce new functions starting with _.

> +/* Mark soundex code tree node as leaf. */
> +static void
> +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
> +{
> + if (!node->is_leaf)
> + {
> + node->is_leaf = 1;
> + leaves_next[(*num_leaves_next)++] = node;
> + }
> +}
> +
> +
> +/* Find next node corresponding to code digit, or create a new node. */
> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
> + dm_node * node, char code_digit)

PG code style is to have a line break between a function defintion's return
type and the function name - like you actually do above.

> +/* Mapping from ISO8859-1 to upper-case ASCII */
> +static const char tr_iso8859_1_to_ascii_upper[] =
> +/*
> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
> +*/
> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ ! ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
> +
> +static char
> +iso8859_1_to_ascii_upper(unsigned char c)
> +{
> + return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
> +}
> +
> +
> +/* Convert an UTF-8 character to ISO-8859-1.
> + * Unconvertable characters are returned as '?'.
> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
> + */
> +static char
> +utf8_to_iso8859_1(char *str, int *ix)

It seems decidedly not great to have custom encoding conversion routines in a
contrib module. Is there any way we can avoid this?

> +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
> +static char *
> +_daitch_mokotoff(char *word, char *soundex, size_t n)
> +{
> + int i = 0,
> + j;
> + int letter_no = 0;
> + int ix_leaves = 0;
> + int num_nodes = 0,
> + num_leaves = 0;
> + dm_codes *codes,
> + *next_codes;
> + dm_node *nodes;
> + dm_leaves *leaves;
> +
> + /* First letter. */
> + if (!(codes = read_letter(word, &i)))
> + {
> + /* No encodable character in input. */
> + return NULL;
> + }
> +
> + /* Allocate memory for node tree. */
> + nodes = palloc(sizeof(dm_nodes));
> + leaves = palloc(2 * sizeof(dm_leaves));

So this allocates the worst case memory usage, is that right? That's quite a
bit of memory. Shouldn't nodes be allocated dynamically?

Instead of carefully freeing individual memory allocations, I think it be
better to create a temporary memory context, allocate the necessary nodes etc
on demand, and destroy the temporary memory context at the end.

> +/* Codes for letter sequence at start of name, before a vowel, and any other. */
> +static dm_codes codes_0_1_X[2] =

Any reason these aren't all const?

It's not clear to me where the intended line between the .h and .c file is.

> +print <<EOF;
> +/*
> + * Types and lookup tables for Daitch-Mokotoff Soundex
> + *

If we generate the code, why is the generated header included in the commit?

> +/* Letter in input sequence */
> +struct dm_letter
> +{
> + char letter; /* Present letter in sequence */
> + struct dm_letter *letters; /* List of possible successive letters */
> + dm_codes *codes; /* Code sequence(s) for complete sequence */
> +};
> +
> +/* Node in soundex code tree */
> +struct dm_node
> +{
> + int soundex_length; /* Length of generated soundex code */
> + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */
> + int is_leaf; /* Candidate for complete soundex code */
> + int last_update; /* Letter number for last update of node */
> + char code_digit; /* Last code digit, 0 - 9 */
> +
> + /*
> + * One or two alternate code digits leading to this node. If there are two
> + * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
> + * back to the same node.
> + */
> + char prev_code_digits[2];
> + /* One or two alternate code digits moving forward. */
> + char next_code_digits[2];
> + /* ORed together code index(es) used to reach current node. */
> + int prev_code_index;
> + int next_code_index;
> + /* Nodes branching out from this node. */
> + struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
> +};
> +
> +typedef struct dm_letter dm_letter;
> +typedef struct dm_node dm_node;

Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
but it seems that the structs could just be defined in the .c file.

> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html

What does "adapted" mean here? And what's the path to updating the data?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-12-07 18:59:43 Re: Partial aggregates pushdown
Previous Message Andres Freund 2022-12-07 18:23:37 Re: Transaction timeout