From: | "Greg Sabino Mullane" <greg(at)turnstep(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: String Similarity |
Date: | 2006-05-19 22:50:00 |
Message-ID: | 6592ec8ffe8907400bc98e9efa60c62c@biglumber.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> I have a side project that needs to "intelligently" know if two strings
> are contextually similar.
The examples you gave seem heavy on word order and whitespace consideration,
before applying any algorithms. Here's a quick perl version that does the
job:
CREATE OR REPLACE FUNCTION matchval(text,text)
RETURNS INT LANGUAGE plperlu AS
$$
use strict;
use String::Approx 'adist';
my $uno = join ' ', sort split /\s+/ => lc shift;
my $dos = join ' ', sort split /\s+/ => lc shift;
return adist(length $uno<length $dos ? ($uno,$dos) : ($dos,$uno));
$$;
Some sample runs:
SELECT matchval('pink floyd - dark side of the moon - money', 'dark side of the moon - pink floyd - money');
SELECT matchval('dark floyd of money moon pink side the', 'Money - dark side of the moon - Pink Floyd');
SELECT matchval('dark floyd of money moon pink side the', 'monee - drk sidez of da moon - pink floyd');
SELECT matchval('dark floyd of money moon pink side the', 'pink floyd - animals');
SELECT matchval('dark floyd of money moon pink side the', 'walking on the moon - the police');
The above returns 0, 0, 6, 10, and 17; a score of 0 is an exact match.
- --
Greg Sabino Mullane greg(at)turnstep(dot)com
PGP Key: 0x14964AC8 200605191835
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iD8DBQFEbktUvJuQZxSWSsgRAiCtAJ9nlpqGxlYnimDPp8t5XQsc8y9RywCfZZL6
iU9iPnxHaWOvYCUD7+rK8Do=
=zo3T
-----END PGP SIGNATURE-----
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2006-05-19 23:04:47 | Re: [OT] MySQL is bad, but THIS bad? |
Previous Message | Hannu Krosing | 2006-05-19 22:49:11 | Re: text_position worst case runtime |