Re: Memory-Bounded Hash Aggregation

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2020-01-29 22:48:56
Message-ID: 6e2b4a4475a690fa515829950aeed935696b0df3.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote:
> That sounds weird. Might be pathological in some sense.
>
> I have a wild guess for you. Maybe this has something to do with the
> "test for presorted input" added by commit a3f0b3d68f9. That can
> perform very badly when the input is almost sorted, but has a few
> tuples that are out of order towards the end. (I have called these
> "banana skin tuples" in the past.)

My simple test case is: 'explain analyze select i from big group by
i;', where "big" has 20M tuples.

I tried without that change and it helped (brought the time from 55s to
45s). But if I completely remove the sorting of the freelist, it goes
down to 12s. So it's something about the access pattern.

After digging a bit more, I see that, for Sort, the LogicalTapeSet's
freelist hovers around 300 entries and doesn't grow larger than that.
For HashAgg, it gets up to almost 60K. The pattern in HashAgg is that
the space required is at a maximum after the first spill, and after
that point the used space declines with each batch (because the groups
that fit in the hash table were finalized and emitted, and only the
ones that didn't fit were written out). As the amount of required space
declines, the size of the freelist grows.

That leaves a few options:

1) Cap the size of the LogicalTapeSet's freelist. If the freelist is
growing large, that's probably because it will never actually be used.
I'm not quite sure how to pick the cap though, and it seems a bit hacky
to just leak the freed space.

2) Use a different structure more capable of handling a large fraction
of free space. A compressed bitmap might make sense, but that seems
like overkill to waste effort tracking a lot of space that is unlikely
to ever be used.

3) Don't bother tracking free space for HashAgg at all. There's already
an API for that so I don't need to further hack logtape.c.

4) Try to be clever and shrink the file (or at least the tracked
portion of the file) if the freed blocks are at the end. This wouldn't
be very useful in the first recursive level, but the problem is worst
for the later levels anyway. Unfortunately, I think this requires a
breadth-first strategy to make sure that blocks at the end get freed.
If I do change it to breadth-first also, this does amount to a
significant speedup.

I am leaning toward #1 or #3.

As an aside, I'm curious why the freelist is managed the way it is.
Newly-released blocks are likely to be higher in number (or at least
not the lowest in number), but they are added to the end of an array.
The array is therefore likely to require repeated re-sorting to get
back to descending order. Wouldn't a minheap or something make more
sense?

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2020-01-29 22:51:10 Re: Add %x to PROMPT1 and PROMPT2
Previous Message Alvaro Herrera 2020-01-29 22:11:31 Re: standby apply lag on inactive servers