Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-09-12 12:28:08
Message-ID: CALj2ACXk+GAH_=emK32eJLXP+yF0Vyaiiy44wTpsUw-q7p+9qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 12, 2022 at 12:30 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Sat, Sep 10, 2022 at 07:52:01AM +0530, Bharath Rupireddy wrote:
> > Today, I spent some more time on this issue, I modified the v1 patch
> > posted upthread a bit - now resetting the InstallXLogFileSegmentActive
> > only when the WAL source switched to archive, not every time in
> > archive mode.
> >
> > I'm attaching v2 patch here with, please review it further.
> >
> > Just for the records - there's another report of the assertion failure
> > at [1], many thanks to Kyotaro-san for providing consistent
> > reproducible steps.
> >
> > [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com
>
> Well, the fact that cc2c7d6 is involved here makes this thread an open
> item for PG15 as far as I can see, assigned to Noah (added now in
> CC).

Thanks.

> While reading your last patch, I have found rather confusing that we
> only reset InstallXLogFileSegmentActive when the current source is the
> archives and it does not match the old source. This code is already
> complicated, and I don't think that having more assumptions in its
> internals is a good thing when it comes to its long-term maintenance.
> In short, HEAD is rather conservative when it comes to set
> InstallXLogFileSegmentActive, switching it only when we request
> streaming with RequestXLogStreaming(), but too aggressive when it
> comes to reset it and we want something in the middle ground. FWIW, I
> find better the approach taken by Horiguchi-san in [1] to reset the
> state before we attempt to read WAL from the archives *or* pg_wal,
> after we know that the last source has failed, where we know that we
> are not streaming yet (but recovery may be requested soon).
>
> [1]: https://www.postgresql.org/message-id/20220214.171428.735280610520122187.horikyota.ntt@gmail.com

I think the fix at [1] is wrong. It basically resets
InstallXLogFileSegmentActive for both XLOG_FROM_ARCHIVE and
XLOG_FROM_PG_WAL, remember that we don't need WAL files preallocation
and recycling just for archive. Moreover, if we were to reset just for
archive there, it's too aggressive, meaning for every WAL record, we
take ControlFileLock in exclusive mode to reset it.

IMO, doing it once when we switch the source to archive is the correct
way because it avoids frequent ControlFileLock acquisitions and makes
the fix more intuitive. And we have a comment saying why we reset
InstallXLogFileSegmentActive, if required we can point to the commit
cc2c7d6 in the comments.

> Side note: I don't see much the point of having two routines named
> SetInstallXLogFileSegmentActive and ResetInstallXLogFileSegmentActive
> that do the opposite thing. We could just have one.

It's an implementation choice. However, I changed it in the v3 patch.

Please review the attached v3 patch further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v3-0001-Avoid-race-condition-in-resetting-XLogCtl-Install.patch application/octet-stream 7.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2022-09-12 12:32:55 Re: pg_stat_statements locking
Previous Message Bharath Rupireddy 2022-09-12 11:39:17 Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)