Re: logical decoding and replication of sequences, take 2

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2024-01-24 17:46:37
Message-ID: 929a38f6-b9f9-4cc4-926a-ac972e769e17@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/23/24 21:47, Robert Haas wrote:
> On Thu, Jan 11, 2024 at 11:27 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> 1) desirability: We want a built-in way to handle sequences in logical
>> replication. I think everyone agrees this is not a way to do distributed
>> sequences in an active-active setups, but that there are other use cases
>> that need this feature - typically upgrades / logical failover.
>
> Yeah. I find it extremely hard to take seriously the idea that this
> isn't a valuable feature. How else are you supposed to do a logical
> failover without having your entire application break?
>
>> 2) performance: There was concern about the performance impact, and that
>> it affects everyone, including those who don't replicate sequences (as
>> the overhead is mostly incurred before calls to output plugin etc.).
>>
>> The agreement was that the best way is to have a CREATE SUBSCRIPTION
>> option that would instruct the upstream to decode sequences. By default
>> this option is 'off' (because that's the no-overhead case), but it can
>> be enabled for each subscription.
>
> Seems reasonable, at least unless and until we come up with something better.
>
>> 3) correctness: The last point is about making "transactional" flag
>> correct when the snapshot state changes mid-transaction, originally
>> pointed out by Dilip [4]. Per [5] this however happens to work
>> correctly, because while we identify the change as 'non-transactional'
>> (which is incorrect), we immediately throw it again (so we don't try to
>> apply it, which would error-out).
>
> I've said this before, but I still find this really scary. It's
> unclear to me that we can simply classify updates as transactional or
> non-transactional and expect things to work. If it's possible, I hope
> we have a really good explanation somewhere of how and why it's
> possible. If we do, can somebody point me to it so I can read it?
>

I did try to explain how this works (and why) in a couple places:

1) the commit message
2) reorderbuffer header comment
3) ReorderBufferSequenceIsTransactional comment (and nearby)

It's possible this does not meet your expectations, ofc. Maybe there
should be a separate README for this - I haven't found anything like
that for logical decoding in general, which is why I did (1)-(3).

> To be possibly slightly more clear about my concern, I think the scary
> case is where we have transactional and non-transactional things
> happening to the same sequence in close temporal proximity, either
> within the same session or across two or more sessions. If a
> non-transactional change can get reordered ahead of some transactional
> change upon which it logically depends, or behind some transactional
> change that logically depends on it, then we have trouble. I also
> wonder if there are any cases where the same operation is partly
> transactional and partly non-transactional.
>

I certainly understand this concern, and to some extent I even share it.
Having to differentiate between transactional and non-transactional
changes certainly confused me more than once. It's especially confusing,
because the decoding implicitly changes the perceived ordering/atomicity
of the events.

That being said, I don't think it get reordered the way you're concerned
about. The "transactionality" is determined by relfilenode change, so
how could the reordering happen? We'd have to misidentify change in
either direction - and for nontransactional->transactional change that's
clearly not possible. There has to be a new relfilenode in that xact.

In the other direction (transactional->nontransactional), it can happen
if we fail to decode the relfilenode record. Which is what we discussed
earlier, but came to the conclusion that it actually works OK.

Of course, there might be bugs. I spent quite a bit of effort reviewing
and testing this, but there still might be something wrong. But I think
that applies to any feature.

What would be worse is some sort of thinko in the approach in general. I
don't have a good answer to that, unfortunately - I think it works, but
how would I know for sure? We explored multiple alternative approaches
and all of them crashed and burned ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2024-01-24 17:51:49 Re: UUID v7
Previous Message Robert Haas 2024-01-24 17:46:16 Re: cleanup patches for incremental backup