Re: loop vs. aggregate was: update and group by/aggregate

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: loop vs. aggregate was: update and group by/aggregate
Date: 2008-08-27 11:19:30
Message-ID: 20080827131930.095d66f7@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 27 Aug 2008 10:32:43 +0200
Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> wrote:

> create or replace function UpdateAuthors()
> returns void
> as
> $$
> declare
> _row record;
> _ItemID bigint;
> _Authors varchar(1024);
> _Name varchar(50);
> begin
> _Authors:='';
> _ItemID:=null;
> for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia
> join {catalog_author} a on a.AuthorID=ia.AuthorID
> order by ia.ItemID
> loop
> if(_row.ItemID<>_ItemID) then
> if(length(_Authors)>2) then
> _Authors:=substring(_Authors from 3);
> update {catalog_items} set Authors=_Authors
> where ItemID=_ItemID;
> end if;
> _Authors:='';
> end if;
> _ItemID:=_row.ItemID;
> _Name:=trim(' \t' from _row.Name);
> if(length(_Name)>0) then
> _Authors:= _Authors || ', ' || _Name;
> end if;
> end loop;
> return;
> end;
> $$ language plpgsql volatile;

Replying to myself:
update catalog_items set Authors=s.Authors
from (
select ia.ItemID,
array_to_string(array_accum(trim(' \t]' from a.Name)),', ') as
Authors
from catalog_itemauthor ia
join catalog_author a on a.AuthorID=ia.AuthorID
where a.Name is not null and
length(trim(' \t' from a.Name))>1
group by ia.ItemID)
as s where s.ItemID=catalog_items.ItemID;

but this looks much slower than the function:
function: 113sec
vs.
single statement: 488sec
I repeated the test 3 times with similar results.
Can anybody explain why aggregates under perform so badly?
I just read that most of the times I can't be smarter than the
planner and I thought that this would be one of the circumstances
theplanner could outperform my handwritten function.

here is the explain:
"Hash Join (cost=137408.51..271599.78 rows=209674 width=221)"
" Hash Cond: ("outer".itemid = "inner".itemid)"
" -> HashAggregate (cost=32994.81..36664.11 rows=209674 width=58)"
" -> Hash Join (cost=8544.62..31946.44 rows=209674 width=58)"
" Hash Cond: ("outer".authorid = "inner".authorid)"
" -> Seq Scan on catalog_itemauthor ia (cost=0.00..10297.21 rows=629021 width=12)"
" -> Hash (cost=8309.00..8309.00 rows=94248 width=54)"
" -> Seq Scan on catalog_author a (cost=0.00..8309.00 rows=94248 width=54)"
" Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))"
" -> Hash (cost=79538.96..79538.96 rows=833496 width=189)"
" -> Seq Scan on catalog_items (cost=0.00..79538.96 rows=833496 width=189)"

thanks to Pavel who suggested the use of array_to_string

Sam... I did your same error the first time I wrote the above
statement... missing the where clause but I can't understand the
need for your longer version "to ensure that characters trimmed from
the authors' name are consistent.

I prefer to put constraint on the length of varchar as an early
warning for something that is unexpected.
eg. if the loop was not working as expected I'd get an error. Infact
that's exactly what happened during development of the above
function.
If Authors was a text field I'd be waiting forever, fill the DB with
rubbish etc...
There are not unlimited fields... there are fields where the upper
limit may be hard to guess.
1024 is actually an inherited choice... but I did some research to
find a reasonable upper limit (highest # of authors * longest
author).

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2008-08-27 11:34:10 Re: ENABLE / DISABLE ALL TRIGGERS IN DATABASE
Previous Message Sam Mason 2008-08-27 10:51:42 Re: update and group by/aggregate