Re: daitch_mokotoff module

From: Dag Lem <dag(at)nimrod(dot)no>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2023-01-02 20:43:01
Message-ID: ygea630vfai.fsf@sid.nimrod.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

Search to perform

is the best match for Daitch-Mokotoff, however

, but
in any case I've changed it into return arrays now. I hope it is to your
liking.

>
>> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
>> however if I understand correctly, you want to index names by single
>> sounds, linking all alike sounding names to the same soundex code. I
>> fail to see how that is useful - if you want to find matches for a name,
>> you simply match against all indexed names. If you only consider one
>> sound, you won't find all names that match.
>
> Hmm, I think we're saying the same thing, but from opposite points of
> view. No, I want each name to return multiple codes, but that those
> multiple codes can be treated as a multiple-value array of codes, rather
> than as a single string of space-separated codes.
>
>> In any case, as explained in the documentation, the implementation is
>> intended to be a companion to Full Text Search, thus text is the natural
>> representation for the soundex codes.
>
> Hmm, I don't agree with this point. The numbers are representations of
> the strings, but they don't necessarily have to be strings themselves.
>
>
>> BTW Vera 790000 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes. But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 790000 and 794000. (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes. So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras. And, as a Spanish speaker, that
> would make a lot of sense to me.)
>
>
> Now, it's true that I've chosen to use Spanish names for my silly little
> experiment. Maybe this isn't terribly useful as a practical example,
> because this algorithm seems to have been designed for Jew surnames and
> perhaps not many (or not any) Jews had Spanish surnames. I don't know;
> I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
> called Josep Álvarez in his book "The Winemaker", so I guess it's not
> impossible). Anyway, I suspect if you repeat the experiment with names
> of other origins, you'll find pretty much the same results apply there,
> and that is the whole reason D-M returns multiple codes and not just
> one.
>
>
> Merry Christmas :-)

--
Dag

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Maciek Sakrejda 2023-01-02 20:44:42 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Previous Message Lukas Fittl 2023-01-02 19:50:04 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?