Re: Wrong hash table size calculation in Parallel Hash Join

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Wrong hash table size calculation in Parallel Hash Join
Date: 2020-01-21 03:18:59
Message-ID: CA+hUKGKrVDRRkz9N96BvWE+mJhS-+vBTbQBvxVTLuMVH2YUBMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Jan 18, 2020 at 9:41 PM Andrey Lepikhov
<a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
> While running the Join Order Benchmark [1] test on my laptop with 8GB
> RAM I saw strange situation where linux (Ubuntu 18, PostgreSQL master
> branch) killed a backend (or a parallel worker in some cases) with signal 9.
> Dmesg showed an error:
> [176313.266426] Out of memory: Kill process 16011 (postgres) score 135
> or sacrifice child
> [176313.266432] Killed process 16011 (postgres) total-vm:1602320kB,
> anon-rss:1325908kB, file-rss:1136kB, shmem-rss:44612kB
>
> I spend time investigating and saw the problem.
> Calculation of hash table size based on GUC "work_mem". But if we have
> huge relation, we divides it into the batches, stored into a shared
> tuplestores. At the first write into the batch, tuplestore allocates
> memory buffer:
>
> accessor->write_chunk = (SharedTuplestoreChunk *)
> MemoryContextAllocZero(accessor->context, STS_CHUNK_PAGES * BLCKSZ);
>
> if we have many batches, we have many additional memory allocations. In
> my case work_mem=64MB, nbatches=65000 and we need about 2GB of
> additional memory.

Hi Andrey,

You're right. There are several things that we don't count, when
doing memory budgeting computations, and some them grow in proportion
to the number of batches. Here are three of them:

1. There's an array of ParallelHashJoinBatch objects in shared memory.
2. There's a buffer inside every SharedTuplestoreAccessor (write_chunk).
3. There's a buffer inside every BufFile.

Point 3 applies to non-parallel hash join too. The buffers are bigger
problems, just because they're bigger, especially the one you
highlighted. The double-buffering there could be fixed
(SharedTuplestoreAccessor doesn't care about BufFile's buffering, it
mostly wants it segmentation logic).

I don't think that's the only problem with having large numbers of
batches though. It generates about the worst access pattern
imaginable for the OS and storage. If you set work_mem small enough,
you can finish up writing randomly to many millions of files, but
having to reopen them every time due to max_files_per_process, while
the kernel keeps booting the inodes out of memory and flushing small
unmergeable I/Os all over the device. At some point, it must be
better to do multiple passes (that's what some other systems do when
partitioning hash joins in memory just to avoid (memory) cache misses,
but partitioning to disk files is another thing). Perhaps you'd
partition the data into (say) 1024 batches + one 'overflow' batch for
all the rest, and then partition the rest into the next 1024 batches +
the rest and so on (Tomas Vondra proposed something like this).
Another idea is that you'd limit the number of batches to a sane
number and switch to a kind of nested loop hash join, where you load
fragments of the hash table at a time. The advantage of the looping
approach is that it can handle some cases that the overflow idea
can't, but on the other hand its complexity is worse, it's a kind of
nested loop.

There are some links to threads about these topics here (and I've
added your report as a link):

https://wiki.postgresql.org/wiki/Hash_Join

> I don't specialized in the Parallel Hash Join code before, but
> considering the code, more accurate calculation of the size of hash
> table will be based on the quadratic equation, something like this:
> M_h = M+sqrt(M^2 -M_b*RelSize)/2
>
> here M - work_mem GUC; M_h - estimation of the hash table size; RelSize
> - size of relation in bytes; M_b=4*STS_CHUNK_PAGES * BLCKSZ.
>
> In the case of M^2 < 4*STS_CHUNK_PAGES * BLCKSZ we need to increase
> work_mem locally.
>
> As some sketch of the solution i prepared the patch (see in attachment).
> If this is a significant problem, I'm ready to continue the solution.

We can discuss how to do the accounting maths, but the real question
is: what do you actually want to do when you run out of memory? At
some point you can't create any more batches because that takes more
memory itself, so you have to switch to some new strategy, or give up
on work_mem. If you have to give up on work_mem as soon as you hit
this condition anyway, then you'll still be able to get an OOM
SIGKILL. With the "overflow" idea, you can at least add new batches
"free" (in terms of memory for bookkeeping), though you can't deal
with indivisible batches (extreme skew). With the "loop" idea, you
can stop adding new batches and you have a new way to divide
indivisible batches.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-01-21 07:35:01 BUG #16221: ERROR while importing Plpython , pltcl
Previous Message Michael Paquier 2020-01-21 02:43:03 Re: REINDEX CONCURRENTLY unexpectedly fails