Re: Issues with logical replication

From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: Issues with logical replication
Date: 2017-10-10 01:19:17
Message-ID: 1B3E32D8-FCF4-40B4-AEF9-5C0E3AC57969@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers


> On 2 Oct 2017, at 19:59, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>
>>
>> Program terminated with signal SIGABRT, Aborted.
>> #0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0 0x00007f3608f8ec37 in __GI_raise (sig=sig(at)entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1 0x00007f3608f92028 in __GI_abort () at abort.c:89
>> #2 0x00000000009f5740 in ExceptionalCondition (conditionName=0xbf6b30
>> "!(((xid) != ((TransactionId) 0)))",
>> errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
>> lineNumber=582) at assert.c:54
>> #3 0x000000000086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
>> oper=XLTW_None) at lmgr.c:582
>> #4 0x000000000081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
>> cutoff=898498) at snapbuild.c:1400
>> #5 0x000000000081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
>> lsn=798477760, running=0x2749198) at snapbuild.c:1311
>> #6 0x000000000081c339 in SnapBuildProcessRunningXacts
>> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
>
>
> Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
> transaction ids but we specifically skip those in
> GetRunningTransactionData(). Can you check pg_waldump output for that
> record (lsn is shown in the backtrace)?

I investigated this case and it seems that XactLockTableWait() in SnapBuildWaitSnapshot()
not always work as expected. XactLockTableWait() waits on LockAcquire() for xid to be
completed and if we finally got this lock but transaction is still in progress then such xid
assumed to be a subxid. However LockAcquire/LockRelease cycle can happen after transaction
set xid, but before XactLockTableInsert().

Namely following history happened for xid 5225 and lead to crash:

[backend] LOG: AssignTransactionId: XactTopTransactionId = 5225
[walsender] LOG: LogCurrentRunningXacts: Wrote RUNNING_XACTS xctn=1, xid[0]=5225
[walsender] LOG: XactLockTableWait: LockAcquire 5225
[walsender] LOG: XactLockTableWait: LockRelease 5225
[backend] LOG: AssignTransactionId: LockAcquire ExclusiveLock 5225
[walsender] LOG: TransactionIdIsInProgress: SVC->latestCompletedXid=5224 < xid=5225 => true
[backend] LOG: CommitTransaction: ProcArrayEndTransaction xid=5225, ipw=0
[backend] LOG: CommitTransaction: ResourceOwnerRelease locks xid=5225

I’ve quickly checked other usages of XactLockTableWait() and it seems that it is used mostly with
xids from heap so tx definetly set it lock somewhere in the past.

Not sure what it the best approach to handle that. May be write running xacts only if they already
set their lock?

Also attaching pgbench script that can reproduce failure. I run it over usual pgbench database
in 10 threads. It takes several minutes to crash.

Attachment Content-Type Size
reload.pgb application/octet-stream 583 bytes
unknown_filename text/plain 95 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-10-10 01:31:01 Re: On markers of changed data
Previous Message Michael Paquier 2017-10-10 01:00:10 Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple