Re: Enhancing phonetic search support for more languages - GSoC 2010

From: Dhiraj Lohiya <lohiya(dot)dhiraj(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>
Subject: Re: Enhancing phonetic search support for more languages - GSoC 2010
Date: 2010-04-08 03:35:11
Message-ID: h2ib268c9e91004072035g2eae0879sbaa147605e68478@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm also curious why you chose to focus on the extremely imprecise
> soundex instead of the more discriminating metaphone.
>
>
The main reason to choose soundex over metaphone/double metaphone is for
Indian languages, soundex itself with some customizations works pretty well.
Use of Double Metaphone only increases upon the processing overhead
alongwith the need to store 2 hashes but the performance would remain the
same since the way the words are pronounced in Indian languages is based on
the Phonology of Devnagri script in which we don't have silent letters and
other accent related inclusions (which was a major consideration that went
in the design of Double Metaphone). One more customization required with
reference to Indian languages is that the characters in the words aren't
taken one by one but are broken as substrings of continuous vowels and
consonants and accordingly are mapped to the equivalent class. Also, one
rule from metaphone needs to be incorporated wherein in soundex the first
letter of the word is not considered but we would encode it also for the
corresponding equivalent class.

Now with this approach of Soundex (without consideration for silent letters
and breaking the word into substrings not on a character by character basis)
delivers with almost same performance and much less overhead compared to
Double metaphone with considerations for silent letters, accents etc. which
don't have much impact on Indian languages and hence this would be more
efficient.

For western languages, double metaphone is known to perform with great
results. Hence, it could be used.

My previous mail was concentrated on soundex since I had also considered
how it would proceed to self improve its rule set of equivalent classes,
which is a little trickier in double metaphone whereas in soundex, we can
map the rules based on the corresponding mapping that are present. But this
could be looked upon later whether we want to include this functionality as
well.

So for the SoC project, as proposed, probably I could concentrate on the
algorithmic part for multi-lingual support. Once the framework is set ready
with tutorials and wiki as to how to add rules for a new language, this
could be contributed upon for other users for more languages by the
community and after testing for a particular quality threshold, this could
be incorporated.

Thanks for the inputs. More suggestions/reviews please!

--
Regards
Dhiraj Lohiya

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2010-04-08 03:56:33 Re: FM suffix in to_char Y/YY/YYY still screwy
Previous Message Fujii Masao 2010-04-08 02:15:23 Re: [COMMITTERS] pgsql: Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset()