RE: Assertion failure in SnapBuildInitialSnapshot()

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: RE: Assertion failure in SnapBuildInitialSnapshot()
Date: 2023-01-28 14:54:22
Message-ID: TYAPR01MB586690B7DE22E362A4A65F8CF5CD9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Amit, Sawada-san,

I have also reproduced the failure on PG15 with some debug log, and I agreed that
somebody changed procArray->replication_slot_xmin to InvalidTransactionId.

> > The same assertion failure has been reported on another thread[1].
> > Since I could reproduce this issue several times in my environment
> > I've investigated the root cause.
> >
> > I think there is a race condition of updating
> > procArray->replication_slot_xmin by CreateInitDecodingContext() and
> > LogicalConfirmReceivedLocation().
> >
> > What I observed in the test was that a walsender process called:
> > SnapBuildProcessRunningXacts()
> > LogicalIncreaseXminForSlot()
> > LogicalConfirmReceivedLocation()
> > ReplicationSlotsComputeRequiredXmin(false).
> >
> > In ReplicationSlotsComputeRequiredXmin() it acquired the
> > ReplicationSlotControlLock and got 0 as the minimum xmin since there
> > was no wal sender having effective_xmin.
> >
>
> What about the current walsender process which is processing
> running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
> slot's effective_xmin have a non-zero value? If not, then why?

Normal walsenders which are not for tablesync create a replication slot with
NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
called with need_full_snapshot = false, and slot->effective_xmin is not updated.
It is set as InvalidTransactionId at ReplicationSlotCreate() and no functions update
that. Hence the slot acquired by the walsender may have Invalid effective_min.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2023-01-28 22:13:47 [PATCH] Fix old thinko in formula to compute sweight in numeric_sqrt().
Previous Message Tomas Vondra 2023-01-28 12:05:20 Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)