Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2018-03-29 16:34:58
Message-ID: f5c07663-3846-cc07-3f79-620db4eeff24@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.01.2018 22:41, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here. This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together. One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction. The top-level data structure for that is the infamous
> ReorderBuffer. So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange). When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed. If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction. When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk. The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk. When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example. Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example). If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender). And so on. If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB. And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time. That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096. The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction. (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes. The processing
> at commit time still uses the previous hardcoded mechanism. The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing. There are some
> proposals to change this as well, but they are more involved. Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached. With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit. Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way. The proposed patch searches
> through the entire list of transactions to find the largest one. But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive. Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed
of importing data in multimaster and shardman (sharding based on FDW
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and
once again spilled to the disk at replica before assignment to the
particular apply worker
(last one is specific of multimaster, which can apply received
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference
is that we do not want to spill transaction in temporary file at
replica, but apply it immediately in separate backend and abort
transaction if it is aborted at master. Certainly it will work only with
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it
shows about 1.5x slowdown comparing with standalone/async LR and about
2-3 improvement comparing with sync LR. For two replicas result is 2x
slower than async LR and 2-8 times faster than sync LR (depending on
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at
master.
But if we have just one huge transaction (COPY) importing gigabytes of
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at
replica side.
In this sense this patch doesn't solve the problem with slow load of
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for
decoding transaction and remember LSN and snapshot of this transaction
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL
several times but do not need to spill anything to the disk neither at
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long
interleaving transactions. But in most customer's use cases we have
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for
implementation than transaction streaming. And it doesn't require any
changes in output plugin API.
I realize that it is a little bit late to ask this question once your
patch is almost ready, but what do you think about it? Are there some
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced
with shardman: if there are several publications for different subsets
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding
transactions which have no relation to this publication. But WAL sender
doesn't know it until it reaches the end of this transaction. What is
worser: if transaction is huge, then all WAL senders will spill it to
the disk even through only one of them actually needs it. So data of
huge transaction is written not three times, but N times, where N is
number of publications. The only solution of the problem we can imagine
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large
portions of WAL without decoding. We also want to know opinion of
2ndQuandarnt about this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-29 16:42:36 Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()
Previous Message Peter Eisentraut 2018-03-29 16:30:08 Re: SET TRANSACTION in PL/pgSQL