Quick Links

Re: String Similarity

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: String Similarity
Date:	2006-05-20 11:29:28
Message-ID:	18825.24.91.171.78.1148124568.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> Get pg_trgm http://www.sai.msu.su/~megera/oddmuse/index.cgi/ReadmeTrgm
> It doesn't depends on language.

That's an interesting approach.

This is what I got:

apps$ ./stratest "pink floyd dark side of the moon money" "dark side of
the moon pink floyd"
Match: dark side of the moon
Match: pink floyd
Similarity: 89

One function finds the substring runs, in descending order of length,
between the two strings. After the function, I have number of runs, length
of best run, total number of characters matched.

Without going into too lengthy description, while space and punctuation
are not reliable. Like this "pinkfloyd" or "pink floyd" "darkside" or
"dark side"

Humans are VERY good at seeing these things, computers, pardon, suck.

What I was hoping someone had was a function that could find the substring
runs in something less than a strlen1*strlen2 number of operations and a
numerically sane way of representing the similarity or difference.

In response to

Re: String Similarity at 2006-05-20 04:30:09 from Oleg Bartunov

Responses

Re: String Similarity at 2006-05-20 13:12:21 from Mark Woodward

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dawid Kuroczko	2006-05-20 12:29:01	Re: [HACKERS] [OT] MySQL is bad, but THIS bad?
Previous Message	Tino Wildenhain	2006-05-20 08:57:14	Re: [OT] MySQL is bad, but THIS bad?