PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2017-12-23 04:57:43
Message-ID: 688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.

PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.

WAL overhead
------------

Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

BEGIN;
UPDATE t SET v = v + 1;
SAVEPOINT s1;
UPDATE t SET v = v + 1;
SAVEPOINT s2;
UPDATE t SET v = v + 1;
SAVEPOINT s3;
...
UPDATE t SET v = v + 1;
SAVEPOINT s10;
UPDATE t SET v = v + 1;
COMMIT;

For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.

For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.

In both cases, there was pretty much no measurable impact on performance
(as measured by tps).

I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.

All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.

Simplifying ReorderBuffer
-------------------------

One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.

This is what the apply worker does with the streamed changes and aborts.

It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gz application/gzip 6.8 KB
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gz application/gzip 2.7 KB
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz application/gzip 4.8 KB
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz application/gzip 5.3 KB
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz application/gzip 10.6 KB
0006-Add-support-for-streaming-to-built-in-replication.patch.gz application/gzip 13.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dang Minh Huong 2017-12-23 07:08:57 Re: User defined data types in Logical Replication
Previous Message Michael Paquier 2017-12-23 00:44:30 Re: [HACKERS] [PATCH] Lockable views