Quick Links

Re: String Similarity

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Greg Sabino Mullane" <greg(at)turnstep(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: String Similarity
Date:	2006-05-20 01:00:51
Message-ID:	18465.24.91.171.78.1148086851.squirrel@mail.mohawksoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>> I have a side project that needs to "intelligently" know if two strings
>> are contextually similar.
>
> The examples you gave seem heavy on word order and whitespace
> consideration,
> before applying any algorithms. Here's a quick perl version that does the
> job:

[SNIP]

This is a case where the example was too simple to explain the problem,
sorry. I have an implementation of Oracle's "contains" function for
PostgreSQL, and it does basically what you are doing, and, in fact, also
has Mohawk Software Extensions (LOL) that provide metaphone. The problem
is that parsing white space realy isn't reliable. Sometimes it is
pinkfloyd-darksideofthemoon.

Also, I have been thinking of other applications.

I have a piece of code that does this:

apps$ ./stratest "pink foyd dark side of the moon money" "money dark side
of the moon pink floyd"
Match: dark side of the moon
Match: pink f
Match: money
Match: oyd

apps$ ./stratest "pinkfoyddarksideofthemoonmoney"
"moneydarksideofthemoonpinkfloyd"
Match: darksideofthemoon
Match: pinkf
Match: money
Match: oyd

I need to come up with a numerically sane way of taking this information
and understanding overall "similarity."

In response to

Re: String Similarity at 2006-05-19 22:50:00 from Greg Sabino Mullane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Oleg Bartunov	2006-05-20 04:30:09	Re: String Similarity
Previous Message	Bruce Momjian	2006-05-20 00:25:17	Re: [HACKERS] patch review, please: Autovacuum/Vacuum times via stats.