Quick Links

Consider Spaces in pg_trgm for Better Similarity

From:	"Igal (at) Lucee(dot)org" <igal(at)lucee(dot)org>
To:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Consider Spaces in pg_trgm for Better Similarity
Date:	2018-01-29 05:56:26
Message-ID:	fb93cee1-6020-1b8c-4dad-e7f9741db497@lucee.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Is there a way to consider white space in tri-grams? That would allow
for better matches of phrases.

For example, currently "one two three" and "three two one" would
generate the same tri-grams ({ o, t, on, th, tw,ee ,hre,ne
,one,ree,thr,two,wo }), and the distance of "one two four" will be the
same for both of them. The query:

SELECT   phrase
        ,input
        ,similarity(t1.phrase, t2.input)
        ,word_similarity(t1.phrase, t2.input)
FROM     (values('one two three'),('three two one')) t1(phrase)
        ,(values('one two four')) t2(input);

Returns:

phrase |input |similarity |word_similarity |
--------------|-------------|------------|----------------|
one two three |one two four |0.444444448 |0.615384638 |
three two one |one two four |0.444444448 |0.615384638 |

But surely "one two four" is more similar to "one two three" than to
"three two one".

Any thoughts?

Igal Sapir
Lucee Core Developer
Lucee.org <http://lucee.org/>

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rob Sargent	2018-01-29 06:02:40	Re: Downsides of liberally using CREATE TEMP TABLE ... ON COMMIT DROP
Previous Message	Thiemo Kellner	2018-01-29 05:03:06	Re: FW: Setting up streaming replication problems