String Similarity

From: "Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: String Similarity
Date: 2006-05-19 20:00:48
Message-ID: 18405.24.91.171.78.1148068848.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have a side project that needs to "intelligently" know if two strings
are contextually similar. Think about how CDDB information is collected
and sorted. It isn't perfect, but there should be enough information to be
usable.

Think about this:

"pink floyd - dark side of the moon - money"
"dark side of the moon - pink floyd - money"
"money - dark side of the moon - pink floyd"
etc.

To a human, these strings are almost identical. Similarly:

"dark floyd of money moon pink side the"

Is a puzzle to be solved by 13 year old children before the movie starts.

My post has three questions:

(1) Does anyone know of an efficient and numerically quantified method of
detecting these sorts of things? I currently have a fairly inefficient and
numerically bogus solution that may be the only non-impossible solution
for the problem.

(2) Does any one see a need for this feature in PostgreSQL? If so, what
kind of interface would be best accepted as a patch? I am currently
returning a match liklihood between 0 and 100;

(3) Is there also a desire for a Levenshtein distence function for text
and varchars? I experimented with it, and was forced to write the function
in item #1.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mischa Sandberg 2006-05-19 20:03:08 Porting MSSQL to PGSQL (Was: [OT] MySQL is bad, but THIS bad?)
Previous Message Jim C. Nasby 2006-05-19 20:00:19 Re: [OT] MySQL is bad, but THIS bad?