Re: [HACKERS] Issues with logical replication

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] Issues with logical replication
Date: 2017-11-16 19:41:35
Message-ID: 20171116194135.f7a262nvz5wfhces@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-11-16 10:36:40 -0500, Robert Haas wrote:
> On Wed, Nov 15, 2017 at 8:20 PM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> > I did a sketch of first approach just to confirm that it solves the problem.
> > But there I hold ProcArrayLock during update of flag. Since only reader is
> > GetRunningTransactionData it possible to have a custom lock there. In
> > this case GetRunningTransactionData will hold three locks simultaneously,
> > since it already holds ProcArrayLock and XidGenLock =)
>
> To me, it seems like SnapBuildWaitSnapshot() is fundamentally
> misdesigned

Maybe I'm confused, but why is it fundamentally misdesigned? It's not
such an absurd idea to wait for an xid in a WAL record. I get that
there's a race condition here, which obviously bad, but I don't really
see as evidence of the above claim.

I actually think this code used to be safe because ProcArrayLock used to
be held while generating and logging the running snapshots record. That
was removed when fixing some other bug, but perhaps that shouldn't have
been done...

> and ideally Petr (who wrote the patch) or Andres (who committed it)
> ought to get involved here and help fix this problem. My own first
> inclination would be to rewrite this as a loop: if the transaction ID
> precedes the oldest running XID, then continue; else if
> TransactionIdDidCommit() || TransactionIdDidAbort() then conclude that
> we don't need to wait; else XactLockTableWait() then loop. That way,
> if you hit the race condition, you'll just busy-wait instead of doing
> the wrong thing. Maybe insert a sleep(1) if we retry more than once.
> That sucks, of course, but it seems like a better idea than trying to
> redesign XactLockTableWait() or the procarray, which could affect an
> awful lot of other things.

Hm. Thinking about this.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-11-16 19:44:52 Re: Inlining functions with "expensive" parameters
Previous Message Robert Haas 2017-11-16 19:18:02 Re: Typo in comment