Re: Large Scale Aggregation (HashAgg Enhancement)

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rod Taylor <pg(at)rbt(dot)ca>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large Scale Aggregation (HashAgg Enhancement)
Date: 2006-01-17 21:43:09
Message-ID: 1137534189.3180.288.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2006-01-17 at 14:41 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
> >> The tricky part is to preserve the existing guarantee that tuples are
> >> merged into their aggregate in arrival order.
>
> > You almost had me there... but there isn't any "arrival order".
>
> The fact that it's not in the spec doesn't mean we don't support it.
> Here are a couple of threads on the subject:
> http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
> http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php
>
> Per the second message, this has worked since 7.4, and it was requested
> fairly often before that.

OK.... My interest was in expanding the role of HashAgg, which as Rod
says can be used to avoid the sort, so the overlap between those ideas
was low anyway.

On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
> I was thinking along the lines of having multiple temp files per hash
> bucket. If you have a tuple that needs to migrate from bucket M to
> bucket N, you know that it arrived before every tuple that was
> assigned
> to bucket N originally, so put such tuples into a separate temp file
> and process them before the main bucket-N temp file. This might get a
> little tricky to manage after multiple hash resizings, but in
> principle
> it seems doable.

OK, so we do need to do this when we have a defined arrival order: this
idea the best one so far. I don't see any optimization of this by
ignoring the arrival order, so it seems best to preserve the ordering
this way in all cases.

You can manage that with file naming. Rows moved from batch N to batch M
would be renamed N.M, so you'd be able to use file ordering to retrieve
all files for *.M
That scheme would work for multiple splits too, so that filenames could
grow yet retain their sort order and final target batch properties.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2006-01-17 23:29:22 Re: Large Scale Aggregation (HashAgg Enhancement)
Previous Message Jim C. Nasby 2006-01-17 20:26:39 Re: [HACKERS] Docs off on ILIKE indexing?