Re: Race condition in recovery?

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-22 06:40:22
Message-ID: CAFiTN-sjsyxVOBuu58y8gMEtt4TYtbvR6xLCgJxqUzNZjyWfvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
>
> I haven't tested this, but I will do that. But now we are on the same
> page about the cause of the actual problem I reported.

Now, I have tested. I am able to reproduce the actual problem with your script.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > I had trouble following it completely, but I didn't really spot
> > > anything that seemed definitely wrong. However, I don't understand
> > > what it has to do with where we are now. What I want to understand is:
> > > under exactly what circumstances does it matter that
> > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > > will stream from receiveTLI rather than recoveryTargetTLI?
> >
> > Ah ha! I think I figured it out. To hit this bug, you need to meet the
> > following conditions:
> >
> > 1. Both streaming and archiving have to be configured.
> > 2. You have to promote a new primary.
> > 3. After promoting the new primary you have to start a new standby
> > that doesn't have local WAL and for which the backup was taken from
> > the previous timeline. In Dilip's original scenario, this new standby
> > is actually the old primary, but that's not required.
>
> No, in my original scenario also the new standby was not old primary,
> I had 3 nodes
> node1-> primary, node2 -> standby1, node3-> standby2
> node2 promoted as a new primary and node3's local WAL was removed (so
> that it has to stream checkpoint record from new primary and then
> remaining everything happens as you explain in remaining steps).
>
> > 4. The new standby has to be able to find the history file it needs in
> > the archive but not the WAL files.
> > 5. The new standby needs to have recovery_target_timeline='latest'
> > (which is the default)
> >
> > When you start the new standby, it will fetch the current TLI from its
> > control file. Then, since recovery_target_timeline=latest, the system
> > will try to figure out the latest timeline, which only works because
> > archiving is configured. There seems to be no provision for detecting
> > the latest timeline via streaming. With archiving enabled, though,
> > findNewestTimeLine() will be able to restore the history file created
> > by the promotion of the new primary, which will cause
> > validateRecoveryParameters() to change recoveryTargetTLI.
>
> Right
>
> Then we'll
> > try to read the WAL segment containing the checkpoint record and fail
> > because, by stipulation, only history files are available from the
> > archive. Now, because streaming is also configured, we'll try
> > streaming. That will work, so we'll be able to read the checkpoint
> > record, but now, because WaitForWALToBecomeAvailable() initialized
> > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> > switch to the correct timeline and it all goes wrong.
>
> exactly
>
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
> > empty directory, cd to that directory, and run test.sh. Afterwards
> > examine pgcascade.log. Basically, these scripts just set up the
> > scenario described above. We set up primary and a standby that use
> > recalcitrant_cp as the archive command, and because it's recalcitrant,
> > it's only willing to copy history files, and always fails for WAL
> > files.Then we create a cascading standby by taking a base backup from
> > the standby, but before actually starting it, we promote the original
> > standby. So now it meets all the conditions described above. I tried a
> > couple variants of this test. If I switch the archive command from
> > recalcitrant_cp to just regular cp, then there's no problem. And if I
> > switch it to something that always fails, then there's also no
> > problem. That's because, with either of those changes, condition (4)
> > above is no longer met. In the first case, both files end up in the
> > archive, and in the second case, neither file.
>
> I haven't tested this, but I will do that. But now we are on the same
> page about the cause of the actual problem I reported.
>
> > What about hitting this in real life, with a real archive command?
> > Well, you'd probably need the archive command to be kind of slow and
> > get unlucky on the timing, but there's nothing to prevent it from
> > happening.
>
> Right
>
> > But, it will be WAY more likely if you have Dilip's original scenario,
> > where you try to repurpose an old primary as a standby. It would
> > normally be unlikely that the backup used to create a new standby
> > would have an older TLI, because you typically wouldn't switch masters
> > in between taking a base backup and using it to create a new standby.
> > But the old master always has an older TLI. So (3) is satisfied. For
> > (4) to be satisfied, you need the old master to fail to archive all of
> > its WAL when it shuts down.
>
> For my original case, both standby1 and standby2 are connected to the
> primary. Now, standby1 is promoted and standby2 is shut down. And,
> before restarting, all the local WAL of the standby2 is removed so
> that it can follow the new primary. The primary info and restore
> command for standby2 are changed as per the new primary(standby1).
>
> Now the scenario is that the standby1 has switched the timeline in the
> middle of the segment which contains the checkpoint record, so the
> segment with old TL is renamed to (.partial) and the same segment with
> new TL is not yet archived but the history file for the new TL has
> been archived.
>
> Now, when standby2 restart the remaining things happened as you
> explained, basically it restores the history file and changes the
> recoveryTargetTLI but it doesn't get the WAL file from the archive.
> So try to stream checkpoint record from the primary using
> "ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline.
>
> Now, we may ask that if the WAL segment with old TL on standby1(new
> primary) which contains the checkpoint is already renamed to
> ".partial" then how can it stream using the old TL then the answer is
> below code[1] in the walsender. Basically, the checkpoint record is
> present in both new and old TL as TL switched in the middle of the
> segment, it will send you the record from the new TL even if the
> walreciever asks to stream with old TL. Now walrecievr is under
> impression that it has read from the old TL. And, we know the rest of
> the story that we will set the expectedTLEs based on the old history
> file and never be able to go to the new TL.
>
> Anyways now we understand the issue and there are many ways we can
> reproduce it. Still, I thought of explaining the exact steps how it
> happed for me because now we understand it well so I think it is easy
> to explain :)
>
> [1]
> WalSndSegmentOpen()
> {
> /*-------
> * When reading from a historic timeline, and there is a timeline switch
> * within this segment, read from the WAL segment belonging to the new
> * timeline.
> }
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2021-05-22 08:08:46 Re: seawasp failing, maybe in glibc allocator
Previous Message Bharath Rupireddy 2021-05-22 05:39:10 Re: Fixing the docs for ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION