Re: 9.5: Memory-bounded HashAgg

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-17 17:17:35
Message-ID: 53F0E3AF.9020702@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15.8.2014 19:53, Robert Haas wrote:
> On Thu, Aug 14, 2014 at 2:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>> On Thu, 2014-08-14 at 12:53 -0400, Tom Lane wrote:
>>> Oh? So if we have aggregates like array_agg whose memory footprint
>>> increases over time, the patch completely fails to avoid bloat?
>>
>> Yes, in its current form.
>>
>>> I might think a patch with such a limitation was useful, if it weren't
>>> for the fact that aggregates of that nature are a large part of the
>>> cases where the planner misestimates the table size in the first place.
>>> Any complication that we add to nodeAgg should be directed towards
>>> dealing with cases that the planner is likely to get wrong.
>>
>> In my experience, the planner has a lot of difficulty estimating the
>> cardinality unless it's coming from a base table without any operators
>> above it (other than maybe a simple predicate). This is probably a lot
>> more common than array_agg problems, simply because array_agg is
>> relatively rare compared with GROUP BY in general.
>
> I think that's right, and I rather like your (Jeff's) approach. It's
> definitely true that we could do better if we have a mechanism for
> serializing and deserializing group states, but (1) I think an awful
> lot of cases would get an awful lot better even just with the approach
> proposed here and (2) I doubt we would make the
> serialization/deserialization interfaces mandatory, so even if we had
> that we'd probably want a fallback strategy anyway.

I certainly agree that we need Jeff's approach even if we can do better
in some cases (when we are able to serialize/deserialize the states).

And yes, (mis)estimating the cardinalities is a big issue, and certainly
a source of many problems.

> Furthermore, I don't really see that we're backing ourselves into a
> corner here. If prohibiting creation of additional groups isn't
> sufficient to control memory usage, but we have
> serialization/deserialization functions, we can just pick an arbitrary
> subset of the groups that we're storing in memory and spool their
> transition states off to disk, thus reducing memory even further. I
> understand Tomas' point to be that this is quite different from what
> we do for hash joins, but I think it's a different problem. In the
> case of a hash join, there are two streams of input tuples, and we've
> got to batch them in compatible ways. If we were to, say, exclude an
> arbitrary subset of tuples from the hash table instead of basing it on
> the hash code, we'd have to test *every* outer tuple against the hash
> table for *every* batch. That would incur a huge amount of additional
> cost vs. being able to discard outer tuples once the batch to which
> they pertain has been processed.

Being able to batch inner and outer relations in a matching way is
certainly one of the reasons why hashjoin uses that particular scheme.
There are other reasons, though - for example being able to answer 'Does
this group belong to this batch?' quickly, and automatically update
number of batches.

I'm not saying the lookup is extremely costly, but I'd be very surprised
if it was as cheap as modulo on a 32-bit integer. Not saying it's the
dominant cost here, but memory bandwidth is quickly becoming one of the
main bottlenecks.

> But the situation here isn't comparable, because there's only one
> input stream. I'm pretty sure we'll want to keep track of which
> transition states we've spilled due to lack of memory as opposed to
> those which were never present in the table at all, so that we can
> segregate the unprocessed tuples that pertain to spilled transition
> states from the ones that pertain to a group we haven't begun yet.

Why would that be necessary or useful? I don't see a reason for tracking
that / segregating the tuples.

> And it might be that if we know (or learn as we go along) that we're
> going to vastly blow out work_mem it makes sense to use batching to
> segregate the tuples that we decide not to process onto N tapes binned
> by hash code, so that we have a better chance that future batches will
> be the right size to fit in memory. But I'm not convinced that
> there's a compelling reason why the *first* batch has to be chosen by
> hash code; we're actually best off picking any arbitrary set of groups
> that does the best job reducing the amount of data remaining to be
> processed, at least if the transition states are fixed size and maybe
> even if they aren't.

If you don't choose the fist batch by hash code, it's over, IMHO. You
can't just redo that later easily, because the HashWork items are
already treated separately.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2014-08-17 20:45:25 Re: [GSoC2014] Patch ALTER TABLE ... SET LOGGED
Previous Message Tomas Vondra 2014-08-17 16:39:03 Re: 9.5: Memory-bounded HashAgg