Skip site navigation (1) Skip section navigation (2)

dmetaphone woes

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: dmetaphone woes
Date: 2010-04-05 01:42:23
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
While testing pgindent the other day, I found some infelicities in 
contrib/fuzzystrmatch/dmetaphone.c. From pgindent's point of view, the 
problem is that the code contains two characters in case labels with the 
high bits set, and this blows pgindent up on my Linux box if the locale 
happens be en_US.utf8 instead of C. Now, we can fix that easily enough 
by replacing those characters with the equivalent hexadecimal escapes.

However, that doesn't solve the fundamental problem, which is that the 
code in question is pretty much broken for any encoding but Latin1. (In 
my defence I plead that when I created the module, by porting code from 
a perl module, I was working with pure ASCII data and was much more 
ignorant than I am now about encoding issues.) The rest of the code 
deals in pure ASCII characters, and so it should be safe, I think.

I'm not exactly sure why the algorithm treats these two characters 
(U+00C7 and U+00D1, C with a cedilla, and N with a tilde respectively) 

The code has been there for some time, and nobody has bitched about it 
that I know of, so I'm not in a hurry to fix it, unless people think we 
should do that before 9.0. making the code properly encoding aware would 
probably involve a non-trivial amount of surgery. If not, I'm inclined 
to fix the issue that affects pgindent, and leave the rest as a TODO 
item for 9.1.





pgsql-hackers by date

Next:From: Tom LaneDate: 2010-04-05 01:59:17
Subject: Re: default privileges
Previous:From: Hitoshi HaradaDate: 2010-04-05 01:28:59
Subject: Re: make check hangs in alpha5

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group