Re: [HACKERS] Issues with logical replication

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Issues with logical replication
Date: 2017-11-29 15:46:12
Message-ID: 68879b3a-46c0-0aab-f6a2-d374010b4e71@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

(sorry for not being active here, I am still catching up after being
away for some family issues)

On 16/11/17 21:12, Robert Haas wrote:
> On Thu, Nov 16, 2017 at 2:41 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> To me, it seems like SnapBuildWaitSnapshot() is fundamentally
>>> misdesigned
>>
>> Maybe I'm confused, but why is it fundamentally misdesigned? It's not
>> such an absurd idea to wait for an xid in a WAL record. I get that
>> there's a race condition here, which obviously bad, but I don't really
>> see as evidence of the above claim.
>>
>> I actually think this code used to be safe because ProcArrayLock used to
>> be held while generating and logging the running snapshots record. That
>> was removed when fixing some other bug, but perhaps that shouldn't have
>> been done...
>
> OK. Well, I might be overstating the case. My comment about
> fundamental misdesign was really just based on the assumption that
> XactLockTableWait() could be used to wait for an XID the instant it
> was generated. That was never gonna work and there's no obvious clean
> workaround for the problem. Getting snapshot building to work
> properly seems to be Hard (TM).
>

No kidding, we've been at it since 9.4.

But anyway, while sure we have race condition because
XactLockTableWait() finishes too early, all that should mean is we call
LogStandbySnapshot() too early and as a result taking snapshot (and
hence slot creation) will take longer as we'll wait for next call of
LogStandbySnapshot() from some other caller (because the tx we care
about will still be running). The whole SnapBuildWaitSnapshot() is just
optimization to make slot creation take less time (and also to be able
to write tests).

What I don't understand is how it leads to crash (and I could not
reproduce it using the pgbench file attached in this thread either) and
moreover how it leads to 0 xid being logged. The only explanation I can
come up is that some kind of similar race has to be in
LogStandbySnapshot() but we explicitly check for 0 xid value there.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-11-29 15:47:21 Re: using index or check in ALTER TABLE SET NOT NULL
Previous Message Stephen Frost 2017-11-29 15:21:35 Re: using index or check in ALTER TABLE SET NOT NULL