Re: BF mamba failure

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kouber Saparev <kouber(at)gmail(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BF mamba failure
Date: 2025-09-10 23:28:35
Message-ID: aMIJozPEP5OLI3Yj@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 09, 2025 at 04:07:45PM +0300, Kouber Saparev wrote:
> Yet again one of our replicas died. Should I file a bug report or
> something, what should we do in order to prevent it? Restart the database
> every month/week or so?...

I don't think we need another bug to report the same problem. We are
aware that this may be an issue and that it is hard to track, the
problem is to find room to be able to investigate it, at this stage.

I may be able to come back to it soon-ishly, looking at how to trigger
any race condition. The difficulty is to think how the current code
is able to reach this state, because we have a race condition at hand
in standbys.

As a start, are these failures only in the startup process? Has the
startup process reached a consistent state when the problem happens
because the replay code is too eager at removing the stats entries?
Has it not reached a consistent state. These could be useful hints to
extract a reproducible test case, looking for common patterns.

I'll ask around if I have seen cases like that in the user pool I have
an access to.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-09-10 23:30:56 Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
Previous Message Michael Paquier 2025-09-10 23:21:21 Re: Incorrect logic in XLogNeedsFlush()