Re: Race condition in recovery?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-13 21:07:31
Message-ID: CA+TgmoZ3xNmFnvKqOE5Y7=r+2469dS63xgP9T=h8TAXgX7x+NQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> It seems to me the issue here is not a race condition but
> WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> of a improper timeline. So using recoveryTargetTLI instead of
> receiveTLI for the case fixes this issue.

I agree.

> I believe the 004_timeline_switch.pl detects your issue. And the
> attached change fixes it.

So why does this use recoveryTargetTLI instead of receiveTLI only
conditionally? Why not do it all the time?

The hard thing about this code is that the assumptions are not very
clear. If we don't know why something is a certain way, then we might
break things if we change it. Worse yet, if nobody else knows why it's
like that either, then who knows what assumptions they might be
making? It's hard to be sure that any change is safe.

But that being said, we have a clear definition from the comments for
what expectedTLEs is supposed to contain, and it's only going to end
up with those contents if we initialize it from recoveryTargetTLI. So
I am inclined to think that we ought to do that always, and if it
breaks something, then that's a sign that some other part of the code
also needs fixing, because apparently that hypothetical other part of
the code doesn't work if expctedTLEs contains what the comments say
that it should.

Now maybe that's the wrong idea. But if so, then we're saying that the
definition of expectedTLEs needs to be changed, and we should update
the comments with the new definition, whatever it is. A lot of the
confusion here results from the fact that the code and comments are
inconsistent and we can't tell whether that's intentional or
inadvertent. Let's not leave the next person who looks at this code
wondering the same thing about whatever changes we make.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-05-13 21:33:33 Re: pgsql: autovacuum: handle analyze for partitioned tables
Previous Message Tom Lane 2021-05-13 20:49:46 Re: amvalidate(): cache lookup failed for operator class 123