Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Date: 2023-06-07 05:18:50
Message-ID: CAA4eK1Kro0rej=ZXMhcdjs+aYsZvNywu3-cqdRUtyAp4zqpVWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 6/6/23 17:42, Tomas Vondra wrote:
> >
>
> In investigated this a bit more, and the problem actually seems to be
> more like this:
>
> 1) we create a new logical replication slot
>
> 2) while building the initial snapshot, we start with current insert
> location, and then process records
>
> 3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
> SnapBuildFindSnapshot
>
> 4) SnapBuildFindSnapshot does this:
>
> else if (!builder->building_full_snapshot &&
> SnapBuildRestore(builder, lsn))
> {
> /* there won't be any state to cleanup */
> return false;
> }
>
> 5) because create_logical_replication_slot and get_changes both call
> CreateInitDecodingContext with needs_full_snapshot=false, we end up
> calling SnapBuildRestore()
>
> 6) once in a while this likely hits a snapshot created by a concurrent
> session (for another logical slot) with SNAPBUILD_CONSISTENT state
>

I think this analysis doesn't seem to match what you mentioned in the
previous email which is as follows:
> > However, we skip the assignment, because the log for this call of
> > get_changes says this:
> >
> > LOG: logical decoding found consistent point at 0/6462E5D8
> >
> > so we fail to realize the 26662 is a subxact.

This is because the above LOG is printed when
"running->oldestRunningXid == running->nextXid" not when we restore
the snapshot. Am, I missing something?

>
> I don't know what's the correct fix for this. Maybe we should set
> needs_full_snapshot=true in create_logical_replication_slot when
> creating the initial snapshot? Maybe we should use true even in
> pg_logical_slot_get_changes_guts? This seems to fix the crashes ...
>

I don't think that is advisable because setting "needs_full_snapshot"
to true for decoding means the snapshot will start tracking
non-catalog committed xacts as well which is costly and is not
required for this case.

> That'll prevent reading the serialized snapshots like this, but how
> could that ever work? It seems pretty much guaranteed to ignore any
> assignments that happened right before the snapshot?
>

This part needs some analysis/thoughts. BTW, do you mean that it skips
the assignment (a) because the assignment record is before we reach a
consistent point, or (b) because we start reading WAL after the
assignment, or (c) something else? If you intend to say (a) then can
you please point me to the code you are referring to for the same?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2023-06-07 05:32:22 is pg_log_standby_snapshot() really needed?
Previous Message Dilip Kumar 2023-06-07 04:12:48 Re: Let's make PostgreSQL multi-threaded