Re: Unstable tests for recovery conflict handling

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Unstable tests for recovery conflict handling
Date: 2022-08-03 20:33:46
Message-ID: CA+TgmoaPNnKCd-01pCuJ6Jm2xEHReq3c-yyD=jS9SgMgj1WPOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> The reason nothing might get logged in some cases is that
> e.g. ResolveRecoveryConflictWithLock() tells
> ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
> /*
> * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting
> * "waiting" in PS display by disabling its argument report_waiting
> * because the caller, WaitOnLock(), has already reported that.
> */
>
> so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> logging anything.

I understand why we need to avoid adding "waiting" to the PS status
when we've already done that, but it doesn't seem like that should
imply skipping ereport() of log messages.

I think we could redesign the way the ps display works to make things
a whole lot simpler. Let's have a function set_ps_display() and
another function set_ps_display_suffix(). What gets reported to the OS
is the concatenation of the two. Calling set_ps_display() implicitly
resets the suffix to empty.

AFAICS, that'd let us get rid of this tricky logic, and some other
tricky logic as well. Here, we'd just say set_ps_display_suffix("
waiting") and not worry about whether the caller might have already
done something similar.

> Another question I have about ResolveRecoveryConflictWithLock() is whether
> it's ok that we don't check deadlocks around the
> ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> only block if there's a recovery conflict, in which killing the process ought
> to succeed?

The startup process is supposed to always "win" in any deadlock
situation, so I'm not sure what you think is a problem here. We get
the conflicting lockers. We kill them. If they don't die, that's a
bug, but killing ourselves doesn't really help anything; if we die,
the whole system goes down, which seems undesirable.

> I think there's also might be a problem with the wait loop in ProcSleep() wrt
> recovery conflicts: We rely on interrupts to be processed to throw recovery
> conflict errors, but ProcSleep() is called in a bunch of places with
> interrupts held. An Assert(INTERRUPTS_CAN_BE_PROCESSED()) after releasing the
> partition lock triggers a bunch. It's possible that these aren't problematic
> cases for recovery conflicts, because they're all around extension locks:
> [...]
> Independent of recovery conflicts, isn't it dangerous that we acquire the
> relation extension lock with a buffer content lock held? I *guess* it might be
> ok because BufferAlloc(P_NEW) only acquires buffer content locks in a
> conditional way.

These things both seem a bit sketchy but it's not 100% clear to me
that anything is actually broken. Now it's also not clear to me that
nothing is broken ...

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2022-08-03 21:33:53 pgsql: Fix incorrect tests for SRFs in relation_can_be_sorted_early().
Previous Message Andres Freund 2022-08-03 17:57:02 Re: Unstable tests for recovery conflict handling

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-08-03 20:34:16 Re: pg15b2: large objects lost on upgrade
Previous Message Peter Geoghegan 2022-08-03 20:29:25 Re: pg15b2: large objects lost on upgrade