Re: Unstable tests for recovery conflict handling

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Unstable tests for recovery conflict handling
Date: 2022-08-03 22:07:56
Message-ID: 20220803220756.be7lljtuakaenjzu@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Hi,

On 2022-08-03 16:33:46 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 1:57 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The reason nothing might get logged in some cases is that
> > e.g. ResolveRecoveryConflictWithLock() tells
> > ResolveRecoveryConflictWithVirtualXIDs() to *not* report the waiting:
> > /*
> > * Prevent ResolveRecoveryConflictWithVirtualXIDs() from reporting
> > * "waiting" in PS display by disabling its argument report_waiting
> > * because the caller, WaitOnLock(), has already reported that.
> > */
> >
> > so ResolveRecoveryConflictWithLock() can end up looping indefinitely without
> > logging anything.
>
> I understand why we need to avoid adding "waiting" to the PS status
> when we've already done that, but it doesn't seem like that should
> imply skipping ereport() of log messages.
>
> I think we could redesign the way the ps display works to make things
> a whole lot simpler. Let's have a function set_ps_display() and
> another function set_ps_display_suffix(). What gets reported to the OS
> is the concatenation of the two. Calling set_ps_display() implicitly
> resets the suffix to empty.
>
> AFAICS, that'd let us get rid of this tricky logic, and some other
> tricky logic as well. Here, we'd just say set_ps_display_suffix("
> waiting") and not worry about whether the caller might have already
> done something similar.

That sounds like it'd be an improvement. Of course we still need to fix that
we can signal at a rate not allowing the other side to handle the conflict,
but at least that'd be easier to identify...

> > Another question I have about ResolveRecoveryConflictWithLock() is whether
> > it's ok that we don't check deadlocks around the
> > ResolveRecoveryConflictWithVirtualXIDs() call? It might be ok, because we'd
> > only block if there's a recovery conflict, in which killing the process ought
> > to succeed?
>
> The startup process is supposed to always "win" in any deadlock
> situation, so I'm not sure what you think is a problem here. We get
> the conflicting lockers. We kill them. If they don't die, that's a
> bug, but killing ourselves doesn't really help anything; if we die,
> the whole system goes down, which seems undesirable.

The way deadlock timeout for the startup process works is that we wait for it
to pass and then send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK to the
backends. So it's not that the startup process would die.

The question is basically whether there are cases were
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK would resolve a conflict but
PROCSIG_RECOVERY_CONFLICT_LOCK wouldn't. It seems plausible that there isn't,
but it's also not obvious enough that I'd fully trust it.

Greetings,

Andres Freund

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Masahiko Sawada 2022-08-04 01:23:37 Re: pgsql: Add wait_for_subscription_sync for TAP tests.
Previous Message Tom Lane 2022-08-03 21:33:53 pgsql: Fix incorrect tests for SRFs in relation_can_be_sorted_early().

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2022-08-03 22:35:13 Re: postgres_fdw: batch inserts vs. before row triggers
Previous Message Tom Lane 2022-08-03 21:57:57 Re: postgres_fdw: batch inserts vs. before row triggers