Re: Spurious standby query cancellations

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spurious standby query cancellations
Date: 2015-09-04 22:25:31
Message-ID: CANP8+jLAvBd02PkDiLLvAGyDgDuoxinuFjgPf4OSVrjNWH+Anw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 August 2015 at 22:55, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> In ResolveRecoveryConflictWithLock, there is the comment:
>
> /*
> * If blowing away everybody with conflicting locks doesn't work, after
> * the first two attempts then we just start blowing everybody away
> until
> * it does work.
>
>
> But what it does is something different than that.
>
> At least under some conditions, it will wait patiently for all initially
> conflicting locks to go away. If that doesn't work twice (because new
> lockers joined while we were waiting for the old ones to go away), then it
> will wait patiently for all transactions throughout the system to go away
> even if they don't conflict, perhaps not even realizing that the lock has
> become free in the mean time.
>
> Then when its patience runs out, it kills everything on the system. But
> it never tried to blow away just the conflicting locks, instead it just
> tried to wait them out.
>
> The fact that trying to wait them out didn't work (twice) doesn't mean
> that killing them wouldn't have worked.
>
> I think that it was intended that num_attempts would get incremented only
> once WaitExceedsMaxStandbyDelay becomes true, but that is not what happens
> with the current code.
>
> Currently WaitExceedsMaxStandbyDelay only has one caller. I think it we
> should take the sleep code out of that function and move it into the
> existing call site, and then have ResolveRecoveryConflictWithLock check
> with WaitExceedsMaxStandbyDelay before incrementing num_attempts.
>

I agree with the problem in the current code, thank you for spotting it.

I also agree that the proposed solution would work-ish, if we are
suggesting backpatching this.

It's now possible to fix this by putting a lock wait on the actual lock
request, which wasn't available when I first wrote that, hence the crappy
wait loop. Using the timeout handler would now be the preferred way to
solve this. We can backpatch that to 9.3 if needed, where they were
introduced.

There's an example of how to use lock waits further down
on ResolveRecoveryConflictWithBufferPin(). Could you have a look at doing
it that way?

We probably also need to resurrect my earlier patch to avoid logging
AccessExclusiveLocks on temp tables.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-09-04 22:31:32 Re: Summary of plans to avoid the annoyance of Freezing
Previous Message Tomas Vondra 2015-09-04 21:03:31 Re: PATCH: index-only scans with partial indexes