Skip site navigation (1) Skip section navigation (2)

Re: Can pg_trgm handle non-alphanumeric characters?

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "MauMau" <maumau307(at)gmail(dot)com>,"Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 14:43:08
Message-ID: 4FAB8DAC0200002500047ACD@gw.wicourts.gov (view raw or flat)
Thread:
Lists: pgsql-hackers
"MauMau" <maumau307(at)gmail(dot)com> wrote:
 
>>> On 09-05-2012 19:17, MauMau wrote:
>>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in
>>>> 9.1.4? Would it cause any problems?
 
Yes, it will cause problems.
 
> For information, what kind of breakage would occur?
 
> I imagined removing KEEPONLYALNUM would just accept
> non-alphanumeric characters and cause no harm to those who use
> only alphanumeric characters.
 
This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters.  It
would make similarity (and distance) values less useful for our
current name searches using it.  To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
                         show_trgm                         
-----------------------------------------------------------
 {"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
                      show_trgm                      
-----------------------------------------------------
 {"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
 similarity 
------------
   0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
 similarity 
------------
     0.3125
(1 row)
 
So making the proposed change unconditionally could indeed hurt
current users of the technique.  On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.
 
-Kevin

In response to

Responses

pgsql-hackers by date

Next:From: Magnus HaganderDate: 2012-05-10 14:43:37
Subject: Re: incorrect handling of the timeout in pg_receivexlog
Previous:From: Robert HaasDate: 2012-05-10 14:42:28
Subject: Re: Draft release notes complete

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group