From: | Thom Brown <thom(at)linux(dot)com> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Unaccent performance |
Date: | 2013-06-21 21:52:04 |
Message-ID: | CAA-aLv7MJCp-Lmn7WCW8LrQonvaNb-6jQqjxdSxbmJYzKuqb+w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 21 June 2013 19:04, Thom Brown <thom(at)linux(dot)com> wrote:
> Hi,
>
> The unaccent extension is great, especially with its customisability, but
> it's not always easy to recommend. I witnessed a customer using no less
> than 56 nested replace functions in an SQL function. I looked to see how
> much this can be mitigated by unaccent. It turns out that not all the
> characters they were replacing can be replaced by unaccent, either because
> they replace more than 1 character at a time, or the character they're
> replacing, for some reason, isn't processed by unaccent, even with a custom
> rules file.
>
> So there were 20 characters I could identify that they were replacing. I
> made a custom rules file and compared its performance to the
> difficult-to-manage set of nested replace calls.
>
> CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
> RETURNS text
> LANGUAGE sql
> IMMUTABLE
> AS $function$
> SELECT
> replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> ;
> $function$
>
> postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
> myunaccent
> ----------------------
> AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 726.282 ms
> postgres=# SELECT unaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
> unaccent
> ----------------------
> AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 3305.252 ms
>
> The timings are actually pretty much the same even if I introduce 187
> nested replace calls for every line in the unaccent.rules file for 187
> characters. But the same character set with unaccent increases to 7418.526
> ms with the same type of query as above. That's 10 times more expensive.
>
> Is there a way to boost the performance to make its adoption more
> palatable?
>
Another test passing in a string of 100000 characters gives the following
timings:
unaccent: 240619.395 ms
myunaccent: 785.505 ms
I guess this must indicate that unaccent is processing all rows, and
myunaccent is only being run on the 1 select row? I can't account for
myunaccent always being almost the same duration regardless of string
length otherwise. This is probably an incorrect assessment of performance.
Another test inserting long text strings into a text column of a table
100,000 times, then updating another column to have that unaccented value
using both methods:
unaccent: 3867.306 ms
myunaccent: 43611.732 ms
So I guess this complaint about performance is all just noise.
However, pushing that pointless complaint to one side, I would like to have
the ability to have unaccent support more characters that it doesn't
currently seem to support, such as bullet points, ellipses etc., and also
more than 1 character being replaced. Naturally these aren't appropriate
to fall under the unaccent function itself, but the rules file is good
starting point. It would be a bit like translate, except it would use a
rules file instead of providing strings of single characters to convert.
So say we wanted "(trademark)" to be converted into "™" just as an example,
or ";" to ".". We can't do that with unaccent, but in order to avoid a
huge list of replace functions, a function like unaccent, with a few
adaptations, would solve the problem.
e.g.:
SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')
would return
Commodore Amiga™.
This would ideally somehow cater for replacing tabs and spaces too.
--
Thom
From | Date | Subject | |
---|---|---|---|
Next Message | Troels Nielsen | 2013-06-21 22:29:03 | Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls |
Previous Message | Tom Lane | 2013-06-21 21:44:21 | Re: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks) |