Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2018-01-11 19:41:31
Message-ID: ec3fbad0-06f6-08cb-7f0e-edd3fb0c2785@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
>
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here. This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together. One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction. The top-level data structure for that is the infamous
ReorderBuffer. So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange). When the transaction commits, the
accumulated changes are passed to the output plugin and then freed. If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction. When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk. The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk. When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example. Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example). If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender). And so on. If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB. And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time. That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096. The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction. (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes. The processing
at commit time still uses the previous hardcoded mechanism. The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing. There are some
proposals to change this as well, but they are more involved. Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached. With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit. Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way. The proposed patch searches
through the entire list of transactions to find the largest one. But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive. Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-01-11 19:48:03 Re: [HACKERS] UPDATE of partition key
Previous Message Alvaro Herrera 2018-01-11 19:29:25 Re: CUBE seems a bit confused about ORDER BY