Re: logical decoding and replication of sequences, take 2

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2022-11-17 21:13:23
Message-ID: 84d60912-6eab-9b84-5de3-41765a5449e8@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/17/22 18:07, Andres Freund wrote:
> Hi,
>
> On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
>> On 11/17/22 03:43, Andres Freund wrote:
>>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
>>>> Well, yeah - we can either try to perform the stuff independently of the
>>>> transactions that triggered it, or we can try making it part of some of
>>>> the transactions. Each of those options has problems, though :-(
>>>>
>>>> The first version of the patch tried the first approach, i.e. decode the
>>>> increments and apply that independently. But:
>>>>
>>>> (a) What would you do with increments of sequences created/reset in a
>>>> transaction? Can't apply those outside the transaction, because it
>>>> might be rolled back (and that state is not visible on primary).
>>>
>>> I think a reasonable approach could be to actually perform different WAL
>>> logging for that case. It'll require a bit of machinery, but could actually
>>> result in *less* WAL logging overall, because we don't need to emit a WAL
>>> record for each SEQ_LOG_VALS sequence values.
>>>
>>
>> Could you elaborate? Hard to comment without knowing more ...
>>
>> My point was that stuff like this (creating a new sequence or at least a
>> new relfilenode) means we can't apply that independently of the
>> transaction (unlike regular increments). I'm not sure how a change to
>> WAL logging would make that go away.
>
> Different WAL logging would make it easy to handle that on the logical
> decoding level. We don't need to emit WAL records each time a
> created-in-this-toplevel-xact sequences gets incremented as they're not
> persisting anyway if the surrounding xact aborts. We already need to remember
> the filenode so it can be dropped at the end of the transaction, so we could
> emit a single record for each sequence at that point.
>
>
>>>> (b) What about increments created before we have a proper snapshot?
>>>> There may be transactions dependent on the increment. This is what
>>>> ultimately led to revert of the patch.
>>>
>>> I don't understand this - why would we ever need to process those increments
>>> from before we have a snapshot? Wouldn't they, by definition, be before the
>>> slot was active?
>>>
>>> To me this is the rough equivalent of logical decoding not giving the initial
>>> state of all tables. You need some process outside of logical decoding to get
>>> that (obviously we have some support for that via the exported data snapshot
>>> during slot creation).
>>>
>>
>> Which is what already happens during tablesync, no? We more or less copy
>> sequences as if they were tables.
>
> I think you might have to copy sequences after tables, but I'm not sure. But
> otherwise, yea.
>
>
>>> I assume that part of the initial sync would have to be a new sequence
>>> synchronization step that reads all the sequence states on the publisher and
>>> ensures that the subscriber sequences are at the same point. There's a bit of
>>> trickiness there, but it seems entirely doable. The logical replication replay
>>> support for sequences will have to be a bit careful about not decreasing the
>>> subscriber's sequence values - the standby initially will be ahead of the
>>> increments we'll see in the WAL. But that seems inevitable given the
>>> non-transactional nature of sequences.
>>>
>>
>> See fetch_sequence_data / copy_sequence in the patch. The bit about
>> ensuring the sequence does not go away (say, using page LSN and/or LSN
>> of the increment) is not there, however isn't that pretty much what I
>> proposed doing for "reconciling" the sequence state logged at COMMIT?
>
> Well, I think the approach of logging all sequence increments at commit is the
> wrong idea...
>

But we're not logging all sequence increments, no?

We're logging the state for each sequence touched by the transaction,
but only once - if the transaction incremented the sequence 1000000x
times, we'll still log it just once (at least for this particular purpose).

Yes, if transactions touch each sequence just once, then we're logging
individual increments.

The only more efficient solution would be to decode the existing WAL
(every ~32 increments), and perhaps also tracking which sequences were
accessed by a transaction. And then simply stashing the increments in a
global reorderbuffer hash table, and then applying only the last one at
commit time. This would require the transactional / non-transactional
behavior (I think), but perhaps we can make that work.

Or are you thinking about some other scheme?

> Creating a new relfilenode whenever a sequence is incremented seems like a
> complete no-go to me. That increases sequence overhead by several orders of
> magnitude and will lead to *awful* catalog bloat on the subscriber.
>

You mean on the the apply side? Yes, I agree this needs a better
approach, I've focused on the decoding side so far.

>
>>>
>>>> This version of the patch tries to do the opposite thing - make sure
>>>> that the state after each commit matches what the transaction might have
>>>> seen (for sequences it accessed). It's imperfect, because it might log a
>>>> state generated "after" the sequence got accessed - it focuses on the
>>>> guarantee not to generate duplicate values.
>>>
>>> That approach seems quite wrong to me.
>>>
>>
>> Why? Because it might log a state for sequence as of COMMIT, when the
>> transaction accessed the sequence much earlier?
>
> Mainly because sequences aren't transactional and trying to make them will
> require awful contortions.
>
> While there are cases where we don't flush the WAL / wait for syncrep for
> sequences, we do replicate their state correctly on physical replication. If
> an LSN has been acknowledged as having been replicated, we won't just loose a
> prior sequence increment after promotion, even if the transaction didn't [yet]
> commit.
>

True, I agree we should aim to achieve that.

> It's completely valid for an application to call nextval() in one transaction,
> potentially even abort it, and then only use that sequence value in another
> transaction.
>

I don't quite agree with that - we make no promises about what happens
to sequence changes in aborted transactions. I don't think I've ever
seen an application using such pattern either.

And I'd argue we already fail to uphold such guarantee, because we don't
wait for syncrep if the sequence WAL happened in aborted transaction. So
if you use the value elsewhere (outside PG), you may lose it.

Anyway, I think the scheme I outlined above (with stashing decoded
increments, logged once every 32 values) and applying the latest
increment for all sequences at commit, would work.

>
>
>>> I did some skimming of the referenced thread about the reversal of the last
>>> approach, but I couldn't really understand what the fundamental issues were
>>> with the reverted implementation - it's a very long thread and references
>>> other threads.
>>>
>>
>> Yes, it's long/complex, but I intentionally linked to a specific message
>> which describes the issue ...
>>
>> It's entirely possible there is a simple fix for the issue, and I just
>> got confused / unable to see the solution. The whole issue was due to
>> having a mix of transactional and non-transactional cases, similarly to
>> logical messages - and logicalmsg_decode() has the same issue, so maybe
>> let's talk about that for a moment.
>>
>> See [3] and imagine you're dealing with a transactional message, but
>> you're still building a consistent snapshot. So the first branch applies:
>>
>> if (transactional &&
>> !SnapBuildProcessChange(builder, xid, buf->origptr))
>> return;
>>
>> but because we don't have a snapshot, SnapBuildProcessChange does this:
>>
>> if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
>> return false;
>
> In this case we'd just return without further work in logicalmsg_decode(). The
> problematic case presumably is is when we have a full snapshot but aren't yet
> consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
> true. And we reach:
>
>> which however means logicalmsg_decode() does
>>
>> snapshot = SnapBuildGetOrBuildSnapshot(builder);
>>
>> which crashes, because it hits this assert:
>>
>> Assert(builder->state == SNAPBUILD_CONSISTENT);
>
> I think the problem here is just that we shouldn't even try to get a snapshot
> in the transactional case - note that it's not even used in
> ReorderBufferQueueMessage() for transactional message. The transactional case
> needs to behave like a "normal" change - we might never decode the message if
> the transaction ends up committing before we've reached a consistent point.
>
>
>> The sequence decoding did almost the same thing, with the same issue.
>> Maybe the correct thing to do is to just ignore the change in this case?
>
> No, I don't think that'd be correct, the message | sequence needs to be queued
> for the transaction. If the transaction ends up committing after we've reached
> consistency, we'll get the correct snapshot from the base snapshot set in
> SnapBuildProcessChange().
>

Yeah, I think you're right. I looked at this again, with fresh mind, and
I came to the same conclusion. Roughly what the attached patch does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
logical-msg-fix.patch text/x-patch 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2022-11-17 21:16:19 Outdated url to Hunspell
Previous Message Thomas Munro 2022-11-17 21:13:09 Re: ubsan fails on 32bit builds