Re: logical decoding : exceeded maxAllocatedDescs for .spill files

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Date: 2020-01-09 23:51:41
Message-ID: 14739.1578613901@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Thu, Jan 9, 2020 at 11:15 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Noah Misch <noah(at)leadboat(dot)com> writes:
>>> Even so, a web search for "extend_brk" led to the answer. By default, 32-bit
>>> AIX binaries get only 256M of RAM for stack and sbrk. The new regression test
>>> used more than that, hence this crash.

>> Hm, so
>> (1) Why did we get a crash and not some more-decipherable out-of-resources
>> error? Can we improve that experience?
>> (2) Should we be dialing back the resource consumption of this test?

> In HEAD, we have a guc variable 'logical_decoding_work_mem' by which
> we can control the memory usage of changes and we have used that, but
> for back branches, we don't have such a control.

I poked into this a bit more by running the src/test/recovery tests under
restrictive ulimit settings. I used

ulimit -s 1024
ulimit -v 250000

(At least on my 64-bit RHEL6 box, reducing ulimit -v much below this
causes initdb to fail, apparently because the post-bootstrap process
tries to load all our tsearch and encoding conversion shlibs at once,
and it hasn't got enough VM space to do so. Someday we may have to
improve that.)

I did not manage to duplicate Noah's crash this way. What I see in
the v10 branch is that the new 006_logical_decoding.pl test fails,
but with a clean "out of memory" error. The memory map dump that
that produces fingers the culprit pretty unambiguously:

...
ReorderBuffer: 223302560 total in 26995 blocks; 7056 free (3 chunks); 223295504 used
ReorderBufferByXid: 24576 total in 2 blocks; 11888 free (3 chunks); 12688 used
Slab: TXN: 8192 total in 1 blocks; 5208 free (21 chunks); 2984 used
Slab: Change: 2170880 total in 265 blocks; 2800 free (35 chunks); 2168080 used
...
Grand total: 226714720 bytes in 27327 blocks; 590888 free (785 chunks); 226123832 used

The test case is only inserting 50K fairly-short rows, so this seems
like an unreasonable amount of memory to be consuming for that; and
even if you think it's reasonable, it clearly isn't going to scale
to large production transactions.

Now, the good news is that v11 and later get through
006_logical_decoding.pl just fine under the same restriction.
So we did something in v11 to fix this excessive memory consumption.
However, unless we're willing to back-port whatever that was, this
test case is clearly consuming excessive resources for the v10 branch.

We're not out of the woods either. I also observe that v12 and HEAD
fall over, under these same test conditions, with a stack-overflow
error in the 012_subtransactions.pl test. This seems to be due to
somebody's decision to use a heavily recursive function to generate a
bunch of subtransactions. Is there a good reason for hs_subxids() to
use recursion instead of a loop? If there is, what's the value of
using 201 levels rather than, say, 10?

Anyway it remains unclear why Noah's machine got a crash instead of
something more user-friendly. But the reason why it's only in the
v10 branch seems non-mysterious.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-01-10 00:06:24 Re: pgbench - use pg logging capabilities
Previous Message cary huang 2020-01-09 23:17:47 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)