Re: Memory-Bounded Hash Aggregation

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Taylor Vesely <tvesely(at)pivotal(dot)io>, Adam Lee <ali(at)pivotal(dot)io>, Melanie Plageman <mplageman(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory-Bounded Hash Aggregation
Date: 2019-12-05 07:28:04
Message-ID: 976d9573c6e0534eb90d583c49137c6fe750cb4f.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote:
>
> It looks like the parameter input_tuples passed to hash_spill_init()
> in lookup_hash_entries() is the number of groups estimated by
> planner.
> However, when reloading a spill file, if we run out of memory and
> re-spill, hash_spill_init() is passed batch->input_groups (which is
> actually set from input_ngroups which is the number of tuples in the
> spill file). So, input_tuples is groups and input_groups is
> input_tuples. It may be helpful to rename this.

You're right; this is confusing. I will clarify this in the next patch.

> So, it looks like the HASH_PARTITION_FACTOR is only used when
> re-spilling. The initial hashtable will use work_mem.
> It seems like the reason for using it when re-spilling is to be very
> conservative to avoid more than one re-spill and make sure each spill
> file fits in a hashtable in memory.

It's used any time a spill happens, even the first spill. I'm flexible
on the use of HASH_PARTITION_FACTOR though... it seems not everyone
thinks it's a good idea. To me it's just a knob to tune and I tend to
think over-partitioning is the safer bet most of the time.

> The comment does seem to point to some other reason, though...

I have observed some anomalies where smaller work_mem values (for
already-low values of work_mem) result faster runtime. The only
explanation I have is caching effects.

> 256 actually seems very large. hash_spill_npartitions() will be
> called
> for every respill, so, HASH_MAX_PARTITIONS it not the total number of
> spill files permitted, but, actually, it is the number of respill
> files in a given spill (a spill set). So if you made X partitions
> initially and every partition re-spills, now you would have (at most)
> X * 256 partitions.

Right. Though I'm not sure there's any theoretical max... given enough
input tuples and it will just keep getting deeper. If this is a serious
concern maybe I should make it depth-first recursion by prepending new
work items rather than appending. That would still not bound the
theoretical max, but it would slow the growth.

> If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
> files take up a lot of memory at that point?

Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it
does create a file, so it's offloading some work to the OS to manage
that new file.

It's annoying to properly account for these costs because the memory
needs to be reserved at the time we are building the hash table, but we
don't know how many partitions we want until it comes time to spill.
And for that matter, we don't even know whether we will need to spill
or not.

There are two alternative approaches which sidestep this problem:

1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for
however many partitions that memory can handle. We would still have a
min and max, but the logic for reserving the space would be easy and so
would choosing the number of partitions to create.
* Pro: simple
* Con: lose the ability to choose the numer of partitions

2. Use logtape.c instead (suggestion from Heikki). Supporting more
logical tapes doesn't impose costs on the OS, and we can potentially
use a lot of logical tapes.
* Pro: can use lots of partitions without making lots of files
* Con: buffering still needs to happen somewhere, so we still need
memory for each logical tape. Also, we risk losing locality of read
access when reading the tapes, or perhaps confusing readahead.
Fundamentally, logtapes.c was designed for sequential write, random
read; but we are going to do random write and sequential read.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-12-05 08:32:52 Removal of support for OpenSSL 0.9.8 and 1.0.0
Previous Message Jeff Davis 2019-12-05 06:57:51 Re: Memory-Bounded Hash Aggregation