Enhancing phonetic search support for more languages - GSoC 2010

From: Dhiraj Lohiya <lohiya(dot)dhiraj(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Dave Page <dpage(at)pgadmin(dot)org>
Subject: Enhancing phonetic search support for more languages - GSoC 2010
Date: 2010-04-07 20:24:53
Message-ID: h2rb268c9e91004071324r2ea2471p3135f5d4b485ad30@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello

I am Dhiraj Lohiya, Computer Science undergraduate from BITS Pilani. I
wanted to propose idea to improvise upon the *phonetic search support,
*initially
for some Indian languages like Hindi and Marathi with a framework for
extending it to other languages easily by contributing the rules in a simple
format. I am looking to take it forward as a* GSoC project*. Check out if
you find this interesting enough:

I plan to customize the soundex algorithm for all languages where each
language could have a different phonetic equivalent class of rules
(Generally around 20 rules for most Indian languages I have worked with). I
would keep the approach layered so that support for multiple language rules
could be easily contributed and more languages could be added by others.

Moreover, since it is important that once a base set of rules are defined by
someone, the rules could themselves be added/evolve based on the user input
and usage.
For instance, if many users(above a threshold set by us) insert some
search string
for which no wanted search result is retrieved, we could track what he
finally selects and then accordingly append/modify our set of phonetic rules
based on the phonetic mismatch amongst the query inserted and result wanted
according to our set of rules. Using this, the* rule sets it could evolve
itself when we collect usage statistics from users based on their
experience. *This feature would add a new dimension to the searchfunctionality
and would surely stand out.

Initially I plan to code this for few Indian languages like Hindi, Marathi
etc. and define a simple way (probably a gui on concept based on
GoogleImageLabeler <http://images.google.com/imagelabeler/>, wherein two
words which sound similar will be mapped for improving upon the rules set)
in which rules for different languages can be directly added and then people
knowing those languages could contribute.

*
Samples:*

- Some case of Hindi songs,
- if I search for a song which has word "naiyya" but I spell the word as
''nayya", presently no result would be returned since this is not in the
playlist.
- Moreover, if "pyar" is searched, the results vary than when "pyaar"
is searched but it is easy to realize that both are the same and hence
should give the same results.

*Some background on this:*
I have already worked out a basic customized version of soundex algorithm as
a part of my intern project at
PennyWiseSolutions<http://www.pennywisesolutions.com/>and implemented
it in java (which had features of self improving upon its
rule set based on the 2 input phonetically similar words as well). Right
now, the rule sets are designed only for Hindi and Marathi. The results are
narrowed down pretty well with much less false positives and this works well
with Marath and Hindi. Now since the algorithm part remains same (almost
equivalent to soundex) and only the rule set of other languages is to be
contributed which would be used by the algorithm to process, I guess this
could do. Some specific customization that was done included not to take
care of silent letters like in soundex since when spelling a Hindi word in
English, users don't really use silent letters.

I would be glad to have more input on this.

--
Regards
Dhiraj Lohiya

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-04-07 20:46:14 FM suffix in to_char Y/YY/YYY still screwy
Previous Message Tom Lane 2010-04-07 19:06:55 Re: Win32 timezone matching