out-of-order XID insertion in KnownAssignedXids

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: out-of-order XID insertion in KnownAssignedXids
Date: 2018-10-05 07:06:45
Message-ID: 0c96b653-4696-d4b4-6b5d-78143175d113@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Looks like there is a bug with logging running transactions XIDs and
prepared transactions.
One of our customers get error "FATAL: out-of-order XID insertion in
KnownAssignedXids"
trying to apply backup.
WAL contains the following record:

rmgr: Standby     len (rec/tot):     98/    98, tx: 0, lsn:
1418/A9A76C90, prev 1418/A9A76C48, desc: RUNNING_XACTS nextXid
2004495309 latestCompletedXid 2004495307 oldestRunningXid 2004495290; 3
xacts: 2004495290 2004495308 2004495308

As you can notice, XID 2004495308 is encountered twice which cause error
in KnownAssignedXidsAdd:

    if (head > tail &&
        TransactionIdFollowsOrEquals(KnownAssignedXids[head - 1],
from_xid))
    {
        KnownAssignedXidsDisplay(LOG);
        elog(ERROR, "out-of-order XID insertion in KnownAssignedXids");
    }

The probability of this error is very small but it can quite easily
reproduced: you should just set breakpoint in debugger after calling
MarkAsPrepared in twophase.c and then try to prepare any transaction.
MarkAsPrepared  will add GXACT to proc array and at this moment there
will be two entries in procarray with the same XID:

(gdb) p procArray->numProcs
$2 = 4
(gdb) p allPgXact[procArray->pgprocnos[0]]
$4 = {xid = 513976717, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0
'\000', delayChkpt = 1 '\001', nxids = 0 '\000',
  used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[1]]
$5 = {xid = 0, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 '\000',
delayChkpt = 0 '\000', nxids = 0 '\000',
  used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[2]]
$6 = {xid = 0, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0 '\000',
delayChkpt = 0 '\000', nxids = 0 '\000',
  used = 0 '\000', parent = 0x0}
(gdb) p allPgXact[procArray->pgprocnos[3]]
$7 = {xid = 513976717, xmin = 0, vacuumFlags = 0 '\000', overflowed = 0
'\000', delayChkpt = 0 '\000', nxids = 0 '\000',
  used = 0 '\000', parent = 0x0}

Then you should just wait for sometime until checkpoint timeout is
triggered and it  logs snapshot:

(gdb) bt
#0  0x00000000007f3dab in GetRunningTransactionData () at procarray.c:2240
#1  0x00000000007fab22 in LogStandbySnapshot () at standby.c:943
#2  0x000000000077cde8 in BackgroundWriterMain () at bgwriter.c:331
#3  0x00000000005377f3 in AuxiliaryProcessMain (argc=2,
argv=0x7ffe00aa00e0) at bootstrap.c:446
#4  0x000000000078e07e in StartChildProcess (type=BgWriterProcess) at
postmaster.c:5323
#5  0x000000000078b6f0 in reaper (postgres_signal_arg=17) at
postmaster.c:2948
#6  <signal handler called>
#7  0x00007f1356d665b3 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:84
#8  0x0000000000789931 in ServerLoop () at postmaster.c:1765
#9  0x000000000078906c in PostmasterMain (argc=3, argv=0x1902640) at
postmaster.c:1406
#10 0x00000000006d0e4f in main (argc=3, argv=0x1902640) at main.c:228

Now generated RUNNING_XACTS record contains duplicated XIDs.

I want to ask opinion of community about the best way of fixing this
problem.
Should we avoid storing duplicated XIDs in procarray (by invalidating
XID in original pgaxct)
or eliminate/change check for duplicate in KnownAssignedXidsAdd (for
example just ignore duplicates)?

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anushi Maheshwari 2018-10-05 07:30:23 GCI Mentor
Previous Message Michael Paquier 2018-10-05 07:05:18 Re: pg_upgrade failed with ERROR: null relpartbound for relation 18159 error.