Re: logical decoding and replication of sequences, take 2

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2023-12-12 10:01:38
Message-ID: 12822961-b7de-9d59-dd27-2e3dc3980c7e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

There's been a lot discussed over the past month or so, and it's become
difficult to get a good idea what's the current state - what issues
remain to be solved, what's unrelated to this patch, and how to move if
forward. Long-running threads tend to be confusing, so I had a short
call with Amit to discuss the current state yesterday, and to make sure
we're on the same page. I believe it was very helpful, and I've promised
to post a short summary of the call - issues, what we agreed seems like
a path forward, etc.

Obviously, I might have misunderstood something, in which case Amit can
correct me. And I'd certainly welcome opinions from others.

In general, we discussed three areas - desirability of the feature,
correctness and performance. I believe a brief summary of the agreement
would be this:

- desirability of the feature: Random IDs (UUIDs etc.) are likely a much
better solution for distributed (esp. active-active) systems. But there
are important use cases that are likely to keep using regular sequences
(online upgrades of single-node instances, existing systems, ...).

- correctness: There's one possible correctness issue, when the snapshot
changes to FULL between record creating a sequence relfilenode and that
sequence advancing. This needs to be verified/reproduced, and fixed.

- performance issues: We've agreed the case with a lot of aborts (when
DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've
discussed whether the overhead with many sequence changes (nextval-40)
is acceptable, and/or how to improve it.

Next, I'll go over these points in more details, with my understanding
of what the challenges are, possible solutions etc. Most of this was
discussed/agreed on the call, but some are ideas I had only after the
call when writing this summary.

1) desirability of the feature

Firstly, do we actually want/need this feature? I believe that's very
much a question of what use cases we're targeting.

If we only focus on distributed databases (particularly those with
multiple active nodes), then we probably agree that the right solution
is to not use sequences (~generators of incrementing values) but UUIDs
or similar random identifiers (better not call them sequences, there's
not much sequential about it). The huge advantage is this does not
require replicating any state between the nodes, so logical decoding can
simply ignore them and replicate just the generated values. I don't
think there's any argument about that. If I as building such distributed
system, I'd certainly use such random IDs.

The question is what to do about the other use cases - online upgrades
relying on logical decoding, failovers to logical replicas, and so on.
Or what to do about existing systems that can't be easily changed to use
different/random identifiers. Those are not really distributed systems
and therefore don't quite need random IDs.

Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can
easily lead to massive write amplification, for example. There are
variants like UUIDv7 reducing the impact, but there's other stuff.

My takeaway from this is there's still value in having this feature.

2) correctness

The only correctness issue I'm aware of is the question what happens
when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding
the relfilenode creation and the sequence increment, pointed out by
Dilip in [1].

If this happens (and while I don't have a reproducer, I also don't have
a very clear idea why it couldn't happen), it breaks how the patch
decides between transactional and non-transactional sequence changes.

So this seems like a fatal flaw - it definitely needs to be solved. I
don't have a good idea how to do that, unfortunately. The problem is the
dependency on an earlier record, and that this needs to be evaluated
immediately (in the decode phase). Logical messages don't have the same
issue because the "transactional" flag does not depend on earlier stuff,
and other records are not interpreted until apply/commit, when we know
everything relevant was decoded.

I don't know what the solution is. Either we find a way to make sure not
to lose/skip the smgr record, or we need to rethink how we determine the
transactional flag (perhaps even try again adding it to the WAL record,
but we didn't find a way to do that earlier).

3) performance issues

We have discussed two cases - "ddl-abort" and "nextval-40".

The "ddl-abort" is when the workload does a lot of DDL and then aborts
them, leading to profiles dominated by DecodeCommit. The agreement here
is that while this is a valid issue and we should try fixing it, it's
unrelated to this patch. The issue exists even on master. So in the
context of this patch we can ignore this issue.

The "nextval-40" applies to workloads doing a lot of regular sequence
changes. We only decode/apply changes written to WAL, and that happens
only for every 32 increments or so. The test was with a very simple
transaction (just sequence advanced to write WAL + 1-row insert), which
means it's pretty much a worst case impact. For larger transactions,
it's going to be hardly measurable. Also, this only measured decoding,
not apply (which also will make this less significant).

Most of the overhead comes from ReorderBufferQueueSequence() starting
and then aborting a transaction, per the profile in [2]. This only
happens in the non-transactional case, but we expect that in regular

Anyway, let's say we want to mitigate this overhead. I think there are
three ways to do that:

a) find a way to not have to apply sequence changes immediately, but
queue them until the next commit

This would give a chance to combine multiple sequence changes into a
single "replay change", reducing the overhead. There's a couple problems
with this, though. Firstly, it can't help OLTP workloads because the
transactions are short so sequence changes are unlikely to combine. It's
also, not clear how expensive this be - could it be expensive enough to
outweigh the benefits?

All of this is assuming it can be implemented, we don't have such patch
yet. I was speculating about something like this earlier, but I haven't
managed to make that work. Doesn't mean it's impossible, ofc.

b) provide a way for the output plugin to skip sequence decoding early

The way the decoding is coded now, ReorderBufferQueueSequence does all
the expensive dance even if the output plugin does not implement the
sequence callbacks.

Maybe we should have a way to allow skipping all of this early, right at
the beginning of ReorderBufferQueueSequence (and thus before we even try
to start/abort the transaction).

Ofc, this is not a perfect solution either - it won't help workloads
that actually need/want sequence decoding but the workload is such that
the decoding has significant overhead, or with plugins that choose to
support decoding sequences in genera. For example the built-in output
plugin would certainly support sequences - and the overhead would still
be there (even if no sequences are added to the publication).

b) instruct people to increase the sequence cache from 32 to 1024

This would reduce the number of WAL messages that need to be decoded and
replayed, reducing the overhead proportionally. Of course, this also
means the sequence will "jump forward" more in case of crash or failover
to the logical replica, but I think that's acceptable tradeoff. People
should not expect sequences to be gap-less anyway.

Considering nextval-40 is pretty much a worst-case behavior, I think
this might actually be an acceptable solution/workaround.

regards

[1]
https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/0bc34f71-7745-dc16-d765-5ba1f0776a3f%40enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-12-12 10:18:59 Re: Add --check option to pgindent
Previous Message Zhijie Hou (Fujitsu) 2023-12-12 09:47:49 RE: Synchronizing slots from primary to standby