Skip site navigation (1) Skip section navigation (2)

Re: Can pg_trgm handle non-alphanumeric characters?

From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>,"Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>,"Euler Taveira" <euler(at)timbira(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Can pg_trgm handle non-alphanumeric characters?
Date: 2012-05-10 15:07:59
Message-ID: DD0DD117F67E48E5961C326E5C050A3E@maumau (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
> "MauMau" <maumau307(at)gmail(dot)com> wrote:
>> For information, what kind of breakage would occur?
>> I imagined removing KEEPONLYALNUM would just accept
>> non-alphanumeric characters and cause no harm to those who use
>> only alphanumeric characters.
> This would break our current usages because of the handling of
> trigrams at the "edges" of groups of qualifying characters.  It
> would make similarity (and distance) values less useful for our
> current name searches using it.  To simulate the effect, I used an
> '8' in place of a comma instead of recompiling with the suggested
> change.
> test=# select show_trgm('smith,john');
>                         show_trgm
> -----------------------------------------------------------
> {"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
> (1 row)
> test=# select show_trgm('smith8john');
>                      show_trgm
> -----------------------------------------------------
> {"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
> (1 row)
> test=# select similarity('smith,john', 'jon smith');
> similarity
> ------------
>   0.615385
> (1 row)
> test=# select similarity('smith8john', 'jon smith');
> similarity
> ------------
>     0.3125
> (1 row)
> So making the proposed change unconditionally could indeed hurt
> current users of the technique.  On the other hand, if there was
> fine-grained control of this, it might make trigrams useful for
> searching statute cites (using all characters) as well as names
> (using the current character set); so I wouldn't want it to just be
> controlled by a global GUC.

Thanks for your explanation. Although I haven't understood it well yet, I'll 
consider what you taught. And I'll consider if the tentative measure of 
removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm 
against Japanese text.


In response to


pgsql-hackers by date

Next:From: Tom LaneDate: 2012-05-10 15:14:01
Subject: Re: checkpointer code behaving strangely on postmaster -T
Previous:From: Alvaro HerreraDate: 2012-05-10 15:04:54
Subject: Re: checkpointer code behaving strangely on postmaster -T

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group