Re: hot_standby_feedback vs excludeVacuum and snapshots

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hot_standby_feedback vs excludeVacuum and snapshots
Date: 2018-06-08 08:23:02
Message-ID: CANP8+jK__bi2AtU=FkBn78Xo_phd_Mkgyf1q9xm0iMEWS7bNSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7 June 2018 at 22:25, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2018-06-07 14:19:18 -0700, Andres Freund wrote:

> Look at:
>
> void
> ProcArrayApplyRecoveryInfo(RunningTransactions running)
> ...
> /*
> * Remove stale locks, if any.
> *
> * Locks are always assigned to the toplevel xid so we don't need to care
> * about subxcnt/subxids (and by extension not about ->suboverflowed).
> */
> StandbyReleaseOldLocks(running->xcnt, running->xids);
>
> by excluding running transactions you have, as far as I can tell,
> effectively removed the vacuum truncation AEL from the standby.

I agree, that is correct, there is a bug in my recent commit that
causes a small race window that could potentially lead to someone
reading the size of a relation just before it is truncated and then
falling off the end of the scan, resulting in a block access ERROR,
potentially much later than the truncate.

I have also found another bug which affects what we do next.

For context, AEL locks are normally removed by COMMIT or ABORT.
StandbyReleaseOldLocks() is just a sweeper to catch anything that
didn't send an abort before it died, so it hardly ever activates. The
coding of StandbyReleaseOldLocks() is backwards... if it ain't in the
running list, then we remove it.

But that wasn't working correctly either, since as of 49bff5300d527 we
assigned AELs to subxids. Subxids weren't in the running list and so
AELs held by them would have been removed at the wrong time, an extant
bug in PG10. It looks to me like they would have been removed either
early or late, up to the next runningxact info record. They would be
removed, so no leakage, but the late timing wouldn't be noticeable for
tests or most usage, since it would look just like lock contention.
Early release might give same issue of block access to non-existent
block/file.

So the attached patch fixes both the bug in the recent commit and the
one I just found by observation of 49bff5300d527, since they are
related.

StandbyReleaseOldLocks() can sweep in the same way as
ExpireOldKnownAssignedTransactionIds().

> I also don't understand why this change would be backpatched in the
> first place. It's a relatively minor efficiency thing, no?

As for everything, that is open to discussion. Yes, it seems minor to
me.... until it affects you, then its not. It seems to have affected
Greg.

The attached patch, or a later revision, needs to be backpatched to
PG10 independently of the recent committed patch.

I have yet to test this manually, but will do so tomorrow morning.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
remove_standby_subxid_locks.v1.patch application/octet-stream 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Deepak Goel 2018-06-08 08:38:20 Re: High CPU load caused by the autovacuum launcher process
Previous Message Ron 2018-06-08 07:28:25 Re: High CPU load caused by the autovacuum launcher process