Re: logical decoding and replication of sequences

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Hannu Krosing <hannuk(at)google(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>
Subject: Re: logical decoding and replication of sequences
Date: 2021-11-19 19:54:46
Message-ID: e6ef8862-31e4-1582-9274-f8f57b7f1cab@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've spent a bit of time exploring the alternative approach outlined by
Hannu, i.e. tracking sequences accessed by the transaction, and logging
the final state just once at COMMIT. Attached is an experimental version
of the patch series doing that - 0001 does the original approach
(decoding the sequence updates from WAL) and then 0002 reworks it to
this alternative solution. The 0003 and 0004 stay mostly the same,
except for minor fixes. Some of the tests in 0003/0004 fail, because
0002 changes the semantics in various ways (more about that later).

The original approach (0001) may seem complex at first, but in principle
it just decodes changes to the sequence relation, and either stashes
them into transaction (just like other changes) or applies them right
away. I'd say that's the most complicated part - deciding whether the
change is transactional or not.

0002 reworks that so that it doesn't decode the existing WAL records,
but tracks sequences which have been modified (updated on-disk state)
and then accessed in the current transaction. And then at COMMIT time we
write a new WAL message with info about the sequence.

I realized we already cache sequences for each session - seqhashtab in
sequence.c. It doesn't have any concept of a transaction, but it seems
fairly easy to make that possible. I did this by adding two flags

- needs_log - means the seesion advanced the sequence (on disk)
- touched - true if the current xact called nextval() etc.

The idea is that what matters is updates to on-disk state, so whenever
we do that we set needs_log. But it only matters when the changes are
made visible in a committed transaction. Consider for example this:

BEGIN;
SELECT nextval('s') FROM generate_series(1,10000) s(i);
ROLLBACK;
SELECT nextval('s');

The first nextval() call certainly sets both flags to true, at least for
default sequences caching 32 values. But the values are not confirmed to
the user because of the rollback - this resets 'touched' flag, but
leaves 'needs_log' set to true.

And then the next nextval() - which may easily be just from cache - sets
touched=true again, and logs the sequence state at (implicit) commit.
Which resets both flags again.

The logging/cleanup happens in AtEOXact_Sequences() which gets called
before commit/abort. This walks all cached sequences and writes the
state for those with both flags true (or resets flag for abort).

The cache also keeps info about the last "sequence state" in the
session, which is then used when writing into into WAL.

To write the sequence state into WAL, I've added a new WAL record
xl_logical_sequence to RM_LOGICALMSG_ID, next to the xl_logical_message.
It's a bit arbitrary, maybe it should be part of RM_SEQ_ID, but it does
the trick. I don't think this is the main issue and it's easy enough to
move it elsewhere if needed.

So, that seems fairly straight-forward and it may reduce the number of
replication messages for large transactions. Unfortunately, it's not
much simpler compared to the first approach - the amount of code is
about the same, and there's a bunch of other issues.

The main issue seems to be about ordering. Consider multiple sessions
all advancing the sequence. With the "old" approach this was naturally
ordered - the order in which the increments were written to WAL made
sense. But the sessions may advance the sequences in one order and then
commit in a different order, which mixes the updates. Consider for
example this scenario with two concurrent transactions:

T1: nextval('s') -> allocates values [1,32]
T2: nextval('s') -> allocates values [33,64]
T2: commit -> logs [33,64]
T1: commit -> logs [1,32]

The result is the sequence on the replica diverted because it replayed
the increments in the opposite order.

I can think of two ways to fix this. Firstly, we could "merge" the
increments in some smart way, e.g. by discarding values considered
"stale" (like decrements). But that seems pretty fragile, because the
sequence may be altered in various ways, reset, etc. And it seems more
like transferring responsibility to someone else instead of actually
solving the issue.

The other fix is simply reading the current sequence state from disk at
commit and logging that (instead of the values cached from the last
increment). But I'm rather skeptical about doing such things right
before COMMIT.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0004-Add-support-for-decoding-sequences-to-built-20211118.patch text/x-patch 77.1 KB
0003-Add-support-for-decoding-sequences-to-test_-20211118.patch text/x-patch 20.7 KB
0002-rework-20211118.patch text/x-patch 35.4 KB
0001-Logical-decoding-of-sequences-20211118.patch text/x-patch 43.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-11-19 19:59:12 Re: xlog.c: removing ReadRecPtr and EndRecPtr
Previous Message Alvaro Herrera 2021-11-19 19:50:54 Re: Why not try for a HOT update, even when PageIsFull()?