Re: logical decoding : exceeded maxAllocatedDescs for .spill files

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>, Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Date: 2020-02-07 19:34:34
Message-ID: 20200207193434.wkwne7d4wr5kebye@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-11 23:20:56 -0500, Tom Lane wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
> Nah, don't think I believe that: the test inserts a bunch of tuples,
> but they look like they will all be *exactly* the same size.
>
> CREATE TABLE decoding_test(x integer, y text);
> ...
>
> FOR i IN 1..10 LOOP
> BEGIN
> INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
> EXCEPTION
> when division_by_zero then perform 'dummy';
> END;

I think the issue this triggers higher memory usage in in older versions
is that before

commit cec2edfa7859279f36d2374770ca920c59c73dd8
Author: Amit Kapila <akapila(at)postgresql(dot)org>
Date: 2019-11-16 17:49:33 +0530

Add logical_decoding_work_mem to limit ReorderBuffer memory usage.

we enforced how many changes to keep in memory (vs on disk)

/*
* Maximum number of changes kept in memory, per transaction. After that,
* changes are spooled to disk.
*
* The current value should be sufficient to decode the entire transaction
* without hitting disk in OLTP workloads, while starting to spool to disk in
* other workloads reasonably fast.
*
* At some point in the future it probably makes sense to have a more elaborate
* resource management here, but it's not entirely clear what that would look
* like.
*/
static const Size max_changes_in_memory = 4096;

on a per-transaction basis. And that subtransactions are *different*
transactions for that purpose (as they can be rolled back
separately). As the test generates loads of records for different
subtransactions, they each end up having quite a few changes (including
the tuples pointed to!) in memory at the same time.

Due to the way the limit of 4096 interacts with the 5000 rows inserted
above, we only hit the out of memory error when loading. That's because
when decoding (before the commit has been seen), we spill after 4096 changes:

2020-02-07 11:18:22.399 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 585 to disk
2020-02-07 11:18:22.419 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 586 to disk
2020-02-07 11:18:22.431 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 587 to disk
2020-02-07 11:18:22.443 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 588 to disk
2020-02-07 11:18:22.454 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 589 to disk
2020-02-07 11:18:22.465 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 590 to disk
2020-02-07 11:18:22.477 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 591 to disk
2020-02-07 11:18:22.488 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 592 to disk
2020-02-07 11:18:22.499 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 593 to disk
2020-02-07 11:18:22.511 PST [1136134][3/2] DEBUG: spill 4096 changes in XID 594 to disk

so there's each 5000 - 4096 changes in memory, times 10. But when
actually calling the output plugin (at the commit record), we start with
loading changes back into memory from the start of each
subtransaction. That first entails spilling the tail of that transaction
to disk, and then loading the start:

2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: StartSubTransaction(2) name: replay; blockState: SUB BEGIN; state: INPROGR, xid/subid/cid: 0/2/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG: spill 904 changes in XID 585 to disk
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 585 into memory
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG: spill 904 changes in XID 586 to disk
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 586 into memory
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG: spill 904 changes in XID 587 to disk
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 587 into memory
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG: spill 904 changes in XID 588 to disk
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: restored 4096 changes in XID 588 into memory
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG: spill 904 changes in XID 589 to disk
TopMemoryContext: 161440 total in 7 blocks; 80240 free (68 chunks); 81200 used
...

Because each transaction has 4096 changes in memory, we actually need
more memory here than we did during the decoding phase, where all but
the "current" subtransaction only have 5000 - 4096 changes in memory.

If we instead change the test to insert 4096*2 - 1 tuples each, we run
out of memory earlier:
2020-02-07 11:23:20.540 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 610 to disk
2020-02-07 11:23:20.565 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 611 to disk
2020-02-07 11:23:20.587 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 612 to disk
2020-02-07 11:23:20.608 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 613 to disk
2020-02-07 11:23:20.630 PST [1136134][3/12] DEBUG: spill 4096 changes in XID 614 to disk
TopMemoryContext: 161440 total in 7 blocks; 79264 free (82 chunks); 82176 used
...
2020-02-07 11:23:20.655 PST [1136134][3/12] ERROR: out of memory
2020-02-07 11:23:20.655 PST [1136134][3/12] DETAIL: Failed on request of size 8208.
2020-02-07 11:23:20.655 PST [1136134][3/12] STATEMENT: SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL);

The reason that the per-subxact enforcement of max_changes_in_memory
isn't as noticable in 11 is that there we have the generational
context. Which means that each of the 4096*10 tuples we have in memory
doesn't allocate MaxHeapTupleSize, but instead something like ~30 bytes.

I wonder if we, in the backbranches that don't have generation context,
should just reduce the size of slab allocated tuples to be ~1024 bytes
instead of MaxHeapTupleSize. That's an almost trivial change, as we
already have to support tuples above that limit (in cases the oldtuple
in an update/delete contains toasted columns that we "inlined"). POC for
that attached.

Greetings,

Andres Freund

Attachment Content-Type Size
wip-limit-slab-allocated-tuple-size.diff text/x-diff 2.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-02-07 19:47:58 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Stephen Frost 2020-02-07 19:22:09 Re: pgsql: Prevent running pg_basebackup as root