Quick Links

Re: Unaccent performance

From:	Thom Brown <thom(at)linux(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unaccent performance
Date:	2013-06-21 21:52:04
Message-ID:	CAA-aLv7MJCp-Lmn7WCW8LrQonvaNb-6jQqjxdSxbmJYzKuqb+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 21 June 2013 19:04, Thom Brown <thom(at)linux(dot)com> wrote:

> Hi,
>
> The unaccent extension is great, especially with its customisability, but
> it's not always easy to recommend. I witnessed a customer using no less
> than 56 nested replace functions in an SQL function. I looked to see how
> much this can be mitigated by unaccent. It turns out that not all the
> characters they were replacing can be replaced by unaccent, either because
> they replace more than 1 character at a time, or the character they're
> replacing, for some reason, isn't processed by unaccent, even with a custom
> rules file.
>
> So there were 20 characters I could identify that they were replacing. I
> made a custom rules file and compared its performance to the
> difficult-to-manage set of nested replace calls.
>
> CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
> RETURNS text
> LANGUAGE sql
> IMMUTABLE
> AS $function$
> SELECT
> replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> ;
> $function$
>
> postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂăĄą' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
> myunaccent
> ----------------------
> AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 726.282 ms
> postgres=# SELECT unaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂăĄą' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
> unaccent
> ----------------------
> AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 3305.252 ms
>
> The timings are actually pretty much the same even if I introduce 187
> nested replace calls for every line in the unaccent.rules file for 187
> characters. But the same character set with unaccent increases to 7418.526
> ms with the same type of query as above. That's 10 times more expensive.
>
> Is there a way to boost the performance to make its adoption more
> palatable?
>

Another test passing in a string of 100000 characters gives the following
timings:

unaccent: 240619.395 ms
myunaccent: 785.505 ms

I guess this must indicate that unaccent is processing all rows, and
myunaccent is only being run on the 1 select row? I can't account for
myunaccent always being almost the same duration regardless of string
length otherwise. This is probably an incorrect assessment of performance.

Another test inserting long text strings into a text column of a table
100,000 times, then updating another column to have that unaccented value
using both methods:

unaccent: 3867.306 ms
myunaccent: 43611.732 ms

So I guess this complaint about performance is all just noise.

However, pushing that pointless complaint to one side, I would like to have
the ability to have unaccent support more characters that it doesn't
currently seem to support, such as bullet points, ellipses etc., and also
more than 1 character being replaced. Naturally these aren't appropriate
to fall under the unaccent function itself, but the rules file is good
starting point. It would be a bit like translate, except it would use a
rules file instead of providing strings of single characters to convert.

So say we wanted "(trademark)" to be converted into "™" just as an example,
or ";" to ".". We can't do that with unaccent, but in order to avoid a
huge list of replace functions, a function like unaccent, with a few
adaptations, would solve the problem.

e.g.:

SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')

would return

Commodore Amiga™.

This would ideally somehow cater for replacing tabs and spaces too.

--
Thom

In response to

Unaccent performance at 2013-06-21 18:04:04 from Thom Brown

Responses

Re: Unaccent performance at 2013-06-22 13:56:30 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Troels Nielsen	2013-06-21 22:29:03	Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Previous Message	Tom Lane	2013-06-21 21:44:21	Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)