Skip site navigation (1) Skip section navigation (2)

Re: Is there a similarity-function that minds national charsets?

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Andreas <maps(dot)on(at)gmx(dot)net>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Is there a similarity-function that minds national charsets?
Date: 2012-06-21 02:53:00
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-sql
On 06/21/2012 12:30 AM, Andreas wrote:
> Hi,
> Is there a similarity-function that minds national charsets?
> Over here we've got some special cases that screw up the results on 
> similarity().
> Our characters: ä, ö, ü, ß
> could as well be written as:  ae, oe, ue, ss
> e.g.
> select similarity ( 'Müller', 'Mueller' )
> results to:  0.363636
> In normal cases everything below 0.5 would be to far apart to be 
> considered a match.

That's not just charset aware, that's looking for awareness of 
language-and-dialect specific transliteration rules for representing 
accented chars in 7-bit ASCII. My understanding was that these rules and 
conventions vary and are specific to each language - or even region.

tsearch2 has big language dictionaries to try to handle some issues like 
this (though I don't know about this issue specifically). It's possible 
you could extend the tsearch2 dictionaries with synonyms, possibly 
algorithmically generated.

If you have what you consider to be an acceptable 1:1 translation rule 
you could build a functional index on it and test against that, eg:

CREATE INDEX blah ON thetable ( (flatten_accent(target_column) );
SELECT similarity( flatten_accent('Müller'), target_column );

Note that the flatten_accent function must be IMMUTABLE and can't access 
or refer to data in other tables, columns, etc nor SET (GUC) variables 
that might change at runtime.
Craig Ringer

In response to

pgsql-sql by date

Next:From: RihadDate: 2012-06-21 17:48:56
Subject: Need help building this query
Previous:From: Emi LuDate: 2012-06-20 19:45:12
Subject: Re: Simple method to format a string

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group