|From:||Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>|
|To:||PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>|
|Subject:||Wrong hash table size calculation in Parallel Hash Join|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
While running the Join Order Benchmark  test on my laptop with 8GB
RAM I saw strange situation where linux (Ubuntu 18, PostgreSQL master
branch) killed a backend (or a parallel worker in some cases) with signal 9.
Dmesg showed an error:
[176313.266426] Out of memory: Kill process 16011 (postgres) score 135
or sacrifice child
[176313.266432] Killed process 16011 (postgres) total-vm:1602320kB,
anon-rss:1325908kB, file-rss:1136kB, shmem-rss:44612kB
I spend time investigating and saw the problem.
Calculation of hash table size based on GUC "work_mem". But if we have
huge relation, we divides it into the batches, stored into a shared
tuplestores. At the first write into the batch, tuplestore allocates
accessor->write_chunk = (SharedTuplestoreChunk *)
MemoryContextAllocZero(accessor->context, STS_CHUNK_PAGES * BLCKSZ);
if we have many batches, we have many additional memory allocations. In
my case work_mem=64MB, nbatches=65000 and we need about 2GB of
I don't specialized in the Parallel Hash Join code before, but
considering the code, more accurate calculation of the size of hash
table will be based on the quadratic equation, something like this:
M_h = M+sqrt(M^2 -M_b*RelSize)/2
here M - work_mem GUC; M_h - estimation of the hash table size; RelSize
- size of relation in bytes; M_b=4*STS_CHUNK_PAGES * BLCKSZ.
In the case of M^2 < 4*STS_CHUNK_PAGES * BLCKSZ we need to increase
As some sketch of the solution i prepared the patch (see in attachment).
If this is a significant problem, I'm ready to continue the solution.
The Russian Postgres Company
|Next Message||Michael Paquier||2020-01-18 10:44:46||Re: BUG #16208: background worker "logical replication worker" was terminated by signal 11: Segmentation|
|Previous Message||Mr Kaleek||2020-01-18 08:17:31||Re: BUG #16208: background worker "logical replication worker" was terminated by signal 11: Segmentation|