Re: 9.5: Memory-bounded HashAgg

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.5: Memory-bounded HashAgg
Date: 2014-08-20 18:32:44
Message-ID: CA+TgmoZ=91QYM_bGTtCSk3AovmLWWqKQkTZtR9HFFpAN3u2aJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> Being able to batch inner and outer relations in a matching way is
> certainly one of the reasons why hashjoin uses that particular scheme.
> There are other reasons, though - for example being able to answer 'Does
> this group belong to this batch?' quickly, and automatically update
> number of batches.
>
> I'm not saying the lookup is extremely costly, but I'd be very surprised
> if it was as cheap as modulo on a 32-bit integer. Not saying it's the
> dominant cost here, but memory bandwidth is quickly becoming one of the
> main bottlenecks.

Well, I think you're certainly right that a hash table lookup is more
expensive than modulo on a 32-bit integer; so much is obvious. But if
the load factor is not too large, I think that it's not a *lot* more
expensive, so it could be worth it if it gives us other advantages.
As I see it, the advantage of Jeff's approach is that it doesn't
really matter whether our estimates are accurate or not. We don't
have to decide at the beginning how many batches to do, and then
possibly end up using too much or too little memory per batch if we're
wrong; we can let the amount of memory actually used during execution
determine the number of batches. That seems good. Of course, a hash
join can increase the number of batches on the fly, but only by
doubling it, so you might go from 4 batches to 8 when 5 would really
have been enough. And a hash join also can't *reduce* the number of
batches on the fly, which might matter a lot. Getting the number of
batches right avoids I/O, which is a lot more expensive than CPU.

>> But the situation here isn't comparable, because there's only one
>> input stream. I'm pretty sure we'll want to keep track of which
>> transition states we've spilled due to lack of memory as opposed to
>> those which were never present in the table at all, so that we can
>> segregate the unprocessed tuples that pertain to spilled transition
>> states from the ones that pertain to a group we haven't begun yet.
>
> Why would that be necessary or useful? I don't see a reason for tracking
> that / segregating the tuples.

Suppose there are going to be three groups: A, B, C. Each is an
array_agg(), and they're big, so only of them will fit in work_mem at
a time. However, we don't know that at the beginning, either because
we don't write the code to try or because we do write that code but
our cardinality estimates are way off; instead, we're under the
impression that all four will fit in work_mem. So we start reading
tuples. We see values for A and B, but we don't see any values for C
because those all occur later in the input. Eventually, we run short
of memory and cut off creation of new groups. Any tuples for C are
now going to get written to a tape from which we'll later reread them.
After a while, even that proves insufficient and we spill the
transition state for B to disk. Any further tuples that show up for C
will need to be written to tape as well. We continue processing and
finish group A.

Now it's time to do batch #2. Presumably, we begin by reloading the
serialized transition state for group B. To finish group B, we must
look at all the tuples that might possibly fall in that group. If all
of the remaining tuples are on a single tape, we'll have to read all
the tuples in group B *and* all the tuples in group C; we'll
presumably rewrite the tuples that are not part of this batch onto a
new tape, which we'll then process in batch #3. But if we took
advantage of the first pass through the input to put the tuples for
group B on one tape and the tuples for group C on another tape, we can
be much more efficient - just read the remaining tuples for group B,
not mixed with anything else, and then read a separate tape for group
C.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-08-20 18:34:00 Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
Previous Message Tomas Vondra 2014-08-20 18:13:55 Re: failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks