Quick Links

Re: [HACKERS] Issues with logical replication

From:	Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject:	Re: [HACKERS] Issues with logical replication
Date:	2017-11-29 23:25:58
Message-ID:	7de48098-e1a9-f889-c9ed-6b3981267093@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 29/11/17 20:11, Stas Kelvich wrote:
>
>> On 29 Nov 2017, at 18:46, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>>
>> What I don't understand is how it leads to crash (and I could not
>> reproduce it using the pgbench file attached in this thread either) and
>> moreover how it leads to 0 xid being logged. The only explanation I can
>> come up is that some kind of similar race has to be in
>> LogStandbySnapshot() but we explicitly check for 0 xid value there.
>>
>
> Zero xid isn’t logged. Loop in XactLockTableWait() does following:
>
> for (;;)
> {
> Assert(TransactionIdIsValid(xid));
> Assert(!TransactionIdEquals(xid, GetTopTransactionIdIfAny()));
>
> <...>
>
> xid = SubTransGetParent(xid);
> }
>
> So if last statement is reached for top transaction then next iteration
> will crash in first assert. And it will be reached if whole this loop
> happens before transaction acquired heavyweight lock.
>
> Probability of that crash can be significantly increased be adding sleep
> between xid generation and lock insertion in AssignTransactionId().
>

Yes that helps thanks. Now that I reproduced it I understand, I was
confused by the backtrace that said xid was 0 on the input to
XactLockTableWait() but that's not the case, it's what xid is changed to
in the inner loop.

So what happens is that we manage to do LogStandbySnapshot(), decode the
logged snapshot, and run SnapBuildWaitSnapshot() for a transaction in
between GetNewTransactionId() and XactLockTableInsert() calls in
AssignTransactionId() for that same transaction.

I guess the probability of this happening is increased by the fact that
GetRunningTransactionData() acquires XidGenLock so if there is
GetNewTransactionId() running in parallel it will wait for it to finish
and we'll log immediately after that.

Hmm that means that Robert's loop idea will not help and ProcArrayLock
will not save us either. Maybe we could either rewrite XactLockTableWait
or create another version of it with SubTransGetParent() call replaced
by SubTransGetTopmostTransaction() as that will return the same top
level xid in case the input xid wasn't a subxact. That would make it
safe to be called on transactions that didn't acquire lock on themselves
yet.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: [HACKERS] Issues with logical replication at 2017-11-29 19:11:25 from Stas Kelvich

Responses

Re: [HACKERS] Issues with logical replication at 2017-11-29 23:40:36 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2017-11-29 23:30:37	Re: [HACKERS] Custom compression methods
Previous Message	Chapman Flack	2017-11-29 23:23:40	Re: Would a BGW need shmem_access or database_connection to enumerate databases?