Re: Large Scale Aggregation (HashAgg Enhancement)

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Rod Taylor <pg(at)rbt(dot)ca>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large Scale Aggregation (HashAgg Enhancement)
Date: 2006-01-17 05:01:38
Message-ID: 87y81fwiq5.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
>
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time. For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.

This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.

The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?

The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.

The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-01-17 05:15:02 Re: [HACKERS] source documentation tool doxygen
Previous Message Larry Rosenman 2006-01-17 03:35:40 FW: PGBuildfarm member firefly Branch HEAD Failed at Stage Check