Re: Large Scale Aggregation (HashAgg Enhancement)

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large Scale Aggregation (HashAgg Enhancement)
Date: 2006-01-16 19:02:56
Message-ID: 1137438176.3180.160.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2006-01-16 at 09:42 -0500, Rod Taylor wrote:
> On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote:
> > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > >
> > A question: Are the rows in your 3 B row table clumped together based
> > upon the 100M row key? (or *mostly* so) We might also be able to
>
> They are randomly distributed. Fully sorting the data is quite painful.

...

> I don't understand how this helps.

It wouldn't since your rows are randomly distributed. The idea was not
related to improving HashAgg, but to improving Aggregation for the case
of naturally grouped data.

> I think I need something closer to:
>
> HashAgg
> -> HashSort (to disk)
>
> HashSort would create a number of files on disk with "similar" data.
> Grouping all similar keys into a single temporary file which HashAgg can
> deal with individually (100 loops by 1M target keys instead of 1 loop by
> 100M target keys). The results would be the same as partitioning by
> keyblock and running a HashAgg on each partition, but it would be
> handled by the Executor rather than by client side code.
>
> > > I've written something similar using a client and COPY with temporary
> > > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > > method PostgreSQL falls back to.

That is exactly how the spill to disk logic works for HashJoin (and
incidentally, identical to an Oracle one-pass hash join since both are
based upon the hybrid hash join algorithm).

Multi-pass would only be required to handle very skewed hash
distributions, which HJ doesn't do yet.

So yes, this can be done.

Best Regards, Simon Riggs

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2006-01-16 19:19:06 Re: Large Scale Aggregation (HashAgg Enhancement)
Previous Message Jim C. Nasby 2006-01-16 18:52:16 Re: Surrogate keys (Was: enums)