Re: Race condition in recovery?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-05-21 19:44:35
Message-ID: CA+TgmoZcfxEFyxZYkwoiQpq6y602gdoYw4_zeRiiP=jo7fqd2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Ah ha! I think I figured it out. To hit this bug, you need to meet the
following conditions:

1. Both streaming and archiving have to be configured.
2. You have to promote a new primary.
3. After promoting the new primary you have to start a new standby
that doesn't have local WAL and for which the backup was taken from
the previous timeline. In Dilip's original scenario, this new standby
is actually the old primary, but that's not required.
4. The new standby has to be able to find the history file it needs in
the archive but not the WAL files.
5. The new standby needs to have recovery_target_timeline='latest'
(which is the default)

When you start the new standby, it will fetch the current TLI from its
control file. Then, since recovery_target_timeline=latest, the system
will try to figure out the latest timeline, which only works because
archiving is configured. There seems to be no provision for detecting
the latest timeline via streaming. With archiving enabled, though,
findNewestTimeLine() will be able to restore the history file created
by the promotion of the new primary, which will cause
validateRecoveryParameters() to change recoveryTargetTLI. Then we'll
try to read the WAL segment containing the checkpoint record and fail
because, by stipulation, only history files are available from the
archive. Now, because streaming is also configured, we'll try
streaming. That will work, so we'll be able to read the checkpoint
record, but now, because WaitForWALToBecomeAvailable() initialized
expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
switch to the correct timeline and it all goes wrong.

The attached test script, test.sh seems to reliably reproduce this.
Put that file and the recalcitrant_cp script, also attached, into an
empty directory, cd to that directory, and run test.sh. Afterwards
examine pgcascade.log. Basically, these scripts just set up the
scenario described above. We set up primary and a standby that use
recalcitrant_cp as the archive command, and because it's recalcitrant,
it's only willing to copy history files, and always fails for WAL
files.Then we create a cascading standby by taking a base backup from
the standby, but before actually starting it, we promote the original
standby. So now it meets all the conditions described above. I tried a
couple variants of this test. If I switch the archive command from
recalcitrant_cp to just regular cp, then there's no problem. And if I
switch it to something that always fails, then there's also no
problem. That's because, with either of those changes, condition (4)
above is no longer met. In the first case, both files end up in the
archive, and in the second case, neither file.

What about hitting this in real life, with a real archive command?
Well, you'd probably need the archive command to be kind of slow and
get unlucky on the timing, but there's nothing to prevent it from
happening.

But, it will be WAY more likely if you have Dilip's original scenario,
where you try to repurpose an old primary as a standby. It would
normally be unlikely that the backup used to create a new standby
would have an older TLI, because you typically wouldn't switch masters
in between taking a base backup and using it to create a new standby.
But the old master always has an older TLI. So (3) is satisfied. For
(4) to be satisfied, you need the old master to fail to archive all of
its WAL when it shuts down.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
recalcitrant_cp application/octet-stream 274 bytes
test.sh text/x-sh 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2021-05-21 19:56:47 Re: pgbench test failing on 14beta1 on Debian/i386
Previous Message Bruce Momjian 2021-05-21 18:27:29 Re: compute_query_id and pg_stat_statements