Re: String Similarity

From: "Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To: "Christopher Kings-Lynne" <chris(dot)kings-lynne(at)calorieking(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: String Similarity
Date: 2006-05-22 11:47:56
Message-ID: 18476.24.91.171.78.1148298476.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Try contrib/pg_trgm...

Tri-graphs are interesting, and I'll try to reconsider whether they fit or
not, ut I suspect that do not. (You are the second to recommend it)

Anything based on a word parser is probably not appropriate, the example I
first gave is a little misleading in that it is not the whole of the
problem. Consider this:

"pinkfloyd darkside of the moon - money"

Again, we humans see that this string is almost identical to the others.

I have a working system right now, and it strips all non "alnum()" out of
the strings, then searches the strings for runs, and compiles a list of
runs from longest to shortest.

The algorithm is strlen1*strlen2*N where N is the number of runs detected.
As you can see it is merely brute force.

Secondly, there is a subtle difference between comparing a known string
for which you are searching and comparing two arbitrary strings. The
"known" string is assumed to be in some sort of regular form and the
string to be compared must break down into that form based on your
alorithm. When trying to understand the similarity of two arbitrary
strings, you don't always know what similarities are or what the parsing
rules should be.

ThisIsSomethinThatHumansCanReadButSpaceBasedParsersCanNot.
tHISiSaLSOsOMETHING
Yo uca nalmos trea dthi s

can you see the similarity in these two strings?
CanYouSeeTheSimilarityInTheseTwoStrings?

Ideally I would metahone the individual words, strip out all the white
spaces, and find the run lengths that compare in the two strings, and
calculate the similarity based on the number and size of the runs. I do
not currently metaphone (It isn't clear to me that endcases can be handled
correctly) and I'm not sure how to best calculate similarity. I've been
trying best run, total match, and a host of others, but haven't found one
I really like.

>
> Chris
>
> Mark Woodward wrote:
>> I have a side project that needs to "intelligently" know if two strings
>> are contextually similar. Think about how CDDB information is collected
>> and sorted. It isn't perfect, but there should be enough information to
>> be
>> usable.
>>
>> Think about this:
>>
>> "pink floyd - dark side of the moon - money"
>> "dark side of the moon - pink floyd - money"
>> "money - dark side of the moon - pink floyd"
>> etc.
>>
>> To a human, these strings are almost identical. Similarly:
>>
>> "dark floyd of money moon pink side the"
>>
>> Is a puzzle to be solved by 13 year old children before the movie
>> starts.
>>
>> My post has three questions:
>>
>> (1) Does anyone know of an efficient and numerically quantified method
>> of
>> detecting these sorts of things? I currently have a fairly inefficient
>> and
>> numerically bogus solution that may be the only non-impossible solution
>> for the problem.
>>
>> (2) Does any one see a need for this feature in PostgreSQL? If so, what
>> kind of interface would be best accepted as a patch? I am currently
>> returning a match liklihood between 0 and 100;
>>
>> (3) Is there also a desire for a Levenshtein distence function for text
>> and varchars? I experimented with it, and was forced to write the
>> function
>> in item #1.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>> message can get through to the mailing list cleanly
>
> --
> Christopher Kings-Lynne
>
> Technical Manager
> CalorieKing
> Tel: +618.9389.8777
> Fax: +618.9389.8444
> chris(dot)kings-lynne(at)calorieking(dot)com
> www.calorieking.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2006-05-22 12:39:04 Re: Duplicate definition of LOCALEDIR
Previous Message Peter Eisentraut 2006-05-22 10:17:33 Duplicate definition of LOCALEDIR