O(n^2) aggregates

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: O(n^2) aggregates
Date: 2007-12-10 15:05:04
Message-ID: 87tzmqfv4v.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I was trying to test my patch to do posix_fadvise to speed up bitmap heap
scans (with disappointing results so far) and ran into a bit of a gotcha. I'm
not sure where this should be documented but it probably should be somewhere.

In order to test bitmap heap scan I had to build an array and use the = ANY(a)
form. (The natural approach of building a table of values to search for
produces a hash join or nested loop -- it may make sense to add a new kind of
join which uses an outer bitmap heap scan.)

So I defined an aggregate to group up the random values in an array in the
usual fashion:

CREATE AGGREGATE arrayize(anyelement) (
SFUNC = array_append,
STYPE = anyarray,
INITCOND = '{}'
);

and ran queries like:

select count(*)
from huge
where h = any ((select arrayize( (1+random()*300000000)::integer )
from generate_series(1,1000)
)::integer[])

To test the behaviour for larger and larger samples I bumped the "1000" up
further and further and noticed a larger and large pause in disk access. When
I tried 40000 the query took over 47 minutes nearly all of which had one cpu
pegged at 100%.

What's going on is that arrayize() is actually a O(n^2) algorithm since each
transition requires creating a copy of the entire array.

The solution to this would analogous to what we did to count(). We would need
to add a field to ArrayMetaState which is stored in fn_extra to remember the
last array returned. Then if array_push notices it has been called from an
aggregate context it can store its result in there. The next time it would
extend that array in place (which is code which doesn't currently exist),
possibly repallocing it and return the same pointer.

It's a bit of a hack but I think this is going to be a pretty common use case
and I don't see any more general solution.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2007-12-10 15:11:44 Re: [HACKERS] BUG #3799: csvlog skips some logs
Previous Message Andrew Dunstan 2007-12-10 15:04:46 Re: [HACKERS] BUG #3799: csvlog skips some logs