Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2022-12-23 22:48:02
Message-ID: ygev8m1visd.fsf@sid.nimrod.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> On 2022-Dec-23, Alvaro Herrera wrote:
>

[...]

> I tried downloading a list of surnames from here
> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
> pasted that in a text file and \copy'ed it into a table. Then I ran
> this query
>
> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
> from apellidos
> group by daitch_mokotoff(a)
> order by count(*) desc;
>
> so I have a first entry like this
>
> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
> daitch_mokotoff │ 784000
> count │ 18
>
> but then I have a bunch of other entries with the same code 784000 as
> alternative codes,
>
> string_agg │ Velazco
> daitch_mokotoff │ 784500 784000
> count │ 1
>
> string_agg │ Palacio
> daitch_mokotoff │ 785000 784000
> count │ 1
>
> I suppose I need to group these together somehow, and it would make more
> sense to do that if the values were arrays.
>
>
> If I scroll a bit further down and choose, say, 794000 (a relatively
> popular one), then I have this
>
> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
> Peraza Peres Perez Porras Varas Veras
> daitch_mokotoff │ 794000
> count │ 14
>
> and looking for that code in the result I also get these three
>
> string_agg │ Barca Barco Parco
> daitch_mokotoff │ 795000 794000
> count │ 3
>
> string_agg │ Borja
> daitch_mokotoff │ 790000 794000
> count │ 1
>
> string_agg │ Borjas
> daitch_mokotoff │ 794000 794400
> count │ 1
>
> and then I see that I should also search for possible matches in codes
> 795000, 790000 and 794400, so that gives me
>
> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
> Perea Vera
> daitch_mokotoff │ 790000
> count │ 11
>
> string_agg │ Barriga Borge Borrego Burgo Fraga
> daitch_mokotoff │ 795000
> count │ 5
>
> string_agg │ Borjas
> daitch_mokotoff │ 794000 794400
> count │ 1
>
> which look closely related (compare "Veras" in the first to "Vera" in
> the later set. If you ignore that pseudo-match, you're likely to miss
> possible family relationships.)
>
>
> I suppose if I were a genealogy researcher, I would be helped by having
> each of these codes behave as a separate unit, rather than me having to
> split the string into the several possible contained values.

It seems to me like you're trying to use soundex coding for something it
was never designed for.

As stated in my previous mail, soundex algorithms are designed to index
names on some representation of sound, so that alike sounding names with
different spellings will match, and as shown in the documentation
example, that is exactly what the implementation facilitates.

Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
however if I understand correctly, you want to index names by single
sounds, linking all alike sounding names to the same soundex code. I
fail to see how that is useful - if you want to find matches for a name,
you simply match against all indexed names. If you only consider one
sound, you won't find all names that match.

In any case, as explained in the documentation, the implementation is
intended to be a companion to Full Text Search, thus text is the natural
representation for the soundex codes.

BTW Vera 790000 does not match Veras 794000, because they don't sound
the same (up to the maximum soundex code length).

Best regards

Dag Lem

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joseph Koshakow 2022-12-23 23:03:36 Re: Infinite Interval
Previous Message Robert Haas 2022-12-23 21:55:38 Re: fixing CREATEROLE