Re: Adding REPACK [concurrently]

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Treat <rob(at)xzilla(dot)net>
Subject: Re: Adding REPACK [concurrently]
Date: 2026-05-11 15:17:52
Message-ID: 70574.1778512672@localhost
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Tue, May 5, 2026 at 6:17 PM Antonin Houska <ah(at)cybertec(dot)at> wrote:
> >
> > Antonin Houska <ah(at)cybertec(dot)at> wrote:
> >
> > I think the problem is that with database-specific snapshot,
> > SnapBuildProcessRunningXacts() returns early, w/o adjusting builder->xmin
> >
> > /*
> > * Database specific transaction info may exist to reach CONSISTENT state
> > * faster, however the code below makes no use of it. Moreover, such
> > * record might cause problems because the following normal (cluster-wide)
> > * record can have lower value of oldestRunningXid. In that case, let's
> > * wait with the cleanup for the next regular cluster-wide record.
> > */
> > if (OidIsValid(running->dbid))
> > return;
> >
> > and thus some transactions whose XID is below running->oldestRunningXid may
> > continue to be incorrectly considered running.
> >
> > I originally thought that this should not happen because such transactions
> > will be added to the builder's array of committed transactions by
> > SnapBuildCommitTxn() anyway. However, I failed to notice that COMMIT record of
> > a transaction listed in the xl_running_xacts WAL record is not guaranteed to
> > follow the xl_running_xacts record in WAL. In other words, even if
> > xl_running_xacts is created before a COMMIT record of the contained
> > transaction, it may end up at higher LSN in WAL. So the cleanup I relied on
> > might not take place.
> >
>
> BTW, is it possible to write a test by using injection_points or via
> manual steps (by using debugger, etc) so that we can more clearly
> understand this problem and proposed fix?

So far I could observe the situation in WAL, but have no idea how it can
happen. For example, transaction 49242 gets committed here

rmgr: Transaction len (rec/tot): 46/ 46, tx: 49242, lsn: 0/18BC28C8, prev
0/18BC2890, desc: COMMIT 2026-05-11 16:38:16.603265 CEST

and then it appears in the 'xids' list of RUNNING_XACTS:

rmgr: Standby len (rec/tot): 106/ 106, tx: 0, lsn:
0/18BC3140, prev 0/18BC3100, desc: RUNNING_XACTS nextXid 49255
latestCompletedXid 49241 oldestRunningXid 49242; 13 xacts: 49248 49249 49246
49243 49252 49251 49244 49245 49242 49250 49253 49254 49247; dbid:5

I thought the situation is quite common (and therefore nothing of
SnapBuildProcessRunningXacts() should be skipped), but when trying to
reproduce the problem, I noticed that LogStandbySnapshot() shouldn't allow
that ordering issue when logical decoding is enabled:

/*
* GetRunningTransactionData() acquired ProcArrayLock, we must release it.
* For Hot Standby this can be done before inserting the WAL record
* because ProcArrayApplyRecoveryInfo() rechecks the commit status using
* the clog. For logical decoding, though, the lock can't be released
* early because the clog might be "in the future" from the POV of the
* historic snapshot. This would allow for situations where we're waiting
* for the end of a transaction listed in the xl_running_xacts record
* which, according to the WAL, has committed before the xl_running_xacts
* record. Fortunately this routine isn't executed frequently, and it's
* only a shared lock.
*/
if (!logical_decoding_enabled)
LWLockRelease(ProcArrayLock);

So I don't have the answer right now.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-05-11 15:19:44 Re: Missing jsonb_ ... functions on DOCs
Previous Message Marcos Pegoraro 2026-05-11 15:07:33 Re: Missing jsonb_ ... functions on DOCs