logical decoding and replication of sequences

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>
Subject: logical decoding and replication of sequences
Date: 2021-06-07 22:28:22
Message-ID: d045f3c2-6cfb-06d3-5540-e63c320df8bc@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi,

One of the existing limitations of logical decoding / replication is
that it does no care about sequences. The annoying consequence is that
after a failover to logical replica, all the table data may be
replicated but the sequences are still at the initial values, requiring
some custom solution that moves the sequences forward enough to prevent
duplicities.

There have been attempts to address this in the past, most recently [1],
but none of them got in due to various issues.

This is an attempt, based on [1] (but with many significant parts added
or reworked), aiming to deal with this. The primary purpose of sharing
it is getting feedback and opinions on the design decisions. It's still
a WIP - it works fine AFAICS, but some of the bits may be a bit hackish.

The overall goal is to have the same sequence data on the primary and
logical replica, or something sufficiently close to that, so that the
replica after a failover does not generate duplicate values.

This patch does a couple basic things:

1) extends the logical decoding to handle sequences. It adds a new
callback, similarly to what we have for messages. There's a bit of
complexity with transactional and non-transactional behavior, more
about that later

2) extends test_decoding to support this new callback, printing the
sequence increments (the decoded WAL records)

3) extends built-in replication to support sequences, so publications
may contain both tables and sequences, etc., sequences data sync
when creating subscriptions, etc.

transactional vs. non-transactional
-----------------------------------

The first part (extending logical decoding) is simple in principle. We
simply decode the sequence updates, but then comes a challenge - should
we just treat it transactionally and stash it in reorder buffer, or
just pass it to the output plugin right-away?

For messages, this can be specified as a flag when adding the message,
so the user can decide depending on the message purpose. For sequences,
all we do is nextval() and it depends on the context in which it's used,
we can't just pick one of those approaches.

Consider this, for example:

CREATE SEQUENCE s;
BEGIN;
SELECT nextval('s') FROM generate_series(1,1000) s(i);
ROLLBACK;

If we handle this "transactionally", we'd stash the "nextval" increment
into the transaction, and then discard it due to the rollback, so the
output plugin (and replica) would never get it. So this is an argument
for non-transactional behavior.

On the other hand, consider this:

CREATE SEQUENCE s;
BEGIN;
ALTER SEQUENCE s RESTART WITH 2000;
SELECT nextval('s') FROM generate_series(1,1000) s(i);
ROLLBACK;

In this case the ALTER creates a new relfilenode, and the ROLLBACK does
discard it including the effects of the nextval calls. So here we should
treat it transactionally, stash the increment(s) in the transaction and
just discard it all on rollback.

A somewhat similar example is this

BEGIN;
CREATE SEQUENCE s;
SELECT nextval('s') FROM generate_series(1,1000) s(i);
COMMIT;

Again - the decoded nextval needs to be handled transactionally, because
otherwise it's going to be very difficult for custom plugins to combine
this with DDL replication.

So the patch does a fairly simple thing:

1) By default, sequences are treated non-transactionally, i.e. sent to
the output plugin right away.

2) We track sequences created in running (sub)transactions, and those
are handled transactionally. This includes ALTER SEQUENCE cases,
which create a new relfilenode, which is used as an identifier.

It's a bit more complex, because of cases like this:

BEGIN;
CREATE SEQUENCE s;
SAVEPOINT a;
SELECT nextval('s') FROM generate_series(1,1000) s(i);
ROLLBACK TO a;
COMMIT;

because we must not discard the nextval changes - this is handled by
always stashing the nextval changes to the subxact where the sequence
relfilenode was created.

The tracking is a bit cumbersome - there's a hash table with relfilenode
mapped to XID in which it was created. AFAIK that works, but might be
an issue with many sequences created in running transactions. Not sure.

detecting sequence creation
---------------------------

Detection that a sequence (or rather the relfilenode) was created is
done by adding a "created" flag into the xl_seq_rec, and setting it to
"true" in the first WAL record after the creation. There might be some
other way, but this seemed simple enough.

applying the sequence (ResetSequence2)
--------------------------------------

The decoding pretty much just extracts log_value, log_cnt and is_called
from the sequence, and passes them to the output plugin. On the replica
we extract those from the message, and write them to the local sequence
using a new ResetSequence2 function.

It's possible we don't really need log_cnt and is_called. After all,
log_cnt is zero most of the time anyway, and the worst thing that could
happen if we ignore it is we skip a couple values (which seems fine).

syncing sequences in a subscription
-----------------------------------

After creating a subscription, the sequences get syncronized just like
tables. This part ia a bit hacked together, and there's definitely room
for improvement - e.g. a new bgworker is started for each sequence, as
we simply treat both tabels and sequences as "relation". But all we need
to do for sequences is copying the (last_value, log_cnt, is_called) and
calling ResetSequence2, so maybe we could sync all sequences in a single
worker, or something like that.

new "sequence" publication action
---------------------------------

The publications now have a new "sequence" publication action, which is
enabled by default. This determines whether the publication decodes
sequences or what.

FOR ALL SEQUENCES
-----------------

It should be possible to create FOR ALL SEQUENCES publications, just
like we have FOR ALL TABLES. But this produces shift/reduce conflicts
in the grammar, and I didn't bother dealing with that. So for now it's
required to do ALTER PUBLICATION ... [ADD | DROP] SEQUENCE ...

no streaming support yet
------------------------

There's no supoprt for streaming of in-progress transactions yet, but
should be trivial to add.

GetCurrentTransactionId() in nextval
------------------------------------

There's a bit annoying behavior of nextval() - if you do this:

BEGIN;
CREATE SEQUENCE s;
SAVEPOINT a;
SELECT nextval('s') FROM generate_series(1,100) s(i);
COMMIT;

then the WAL record for nextval (right after the savepoint) will have
XID 0 (easy to see in pg_waldump). That's kinda strange, and it causes
problems in DecodeSequence() when calling

SnapBuildProcessChange(builder, xid, buf->origptr)

for transactional changes, because that expects a valid XID. Fixing
this required adding GetCurrentTransactionId() to nextval() and two
other functions, which were only doing

if (RelationNeedsWAL(seqrel))
GetTopTransactionId();

so far. I'm not sure if this has some particularly bad consequences.

regards

[1]
https://www.postgresql.org/message-id/flat/1710ed7e13b.cd7177461430746.3372264562543607781%40highgo.ca

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
sequence-decoding-20210608.patch text/x-patch 113.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-07 23:10:00 Re: CALL versus procedures with output-only arguments
Previous Message Tom Lane 2021-06-07 22:15:49 Re: Add PortalDrop in exec_execute_message