Re: pg_rewind race condition just after promotion

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: hlinnaka(at)iki(dot)fi
Cc: pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net, alexk(at)hintbits(dot)com, michael(dot)paquier(at)gmail(dot)com
Subject: Re: pg_rewind race condition just after promotion
Date: 2020-12-08 04:45:33
Message-ID: 20201208.134533.202575181178574697.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 7 Dec 2020 20:13:25 +0200, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
> There's a race condition between the checkpoint at promotion and
> pg_rewind. When a server is promoted, the startup process writes an
> end-of-recovery checkpoint that includes the new TLI, and the server
> is immediate opened for business. The startup process requests the
> checkpointer process to perform a checkpoint, but it can take a few
> seconds or more to complete. If you run pg_rewind, using the just
> promoted server as the source, pg_rewind will think that the server is
> still on the old timeline, because it only looks at TLI in the control
> file's copy of the checkpoint record. That's not updated until the
> checkpoint is finished.
>
> This isn't a new issue. Stephen Frost first reported it back 2015
> [1]. Back then, it was deemed just a small annoyance, and we just
> worked around it in the tests by issuing a checkpoint command after
> promotion, to wait for the checkpoint to finish. I just ran into it
> again today, with the new pg_rewind test, and silenced it in the
> similar way.

I (or we) faced that and avoided that by checking for history file on
the primary.

> I think we should fix this properly. I'm not sure if it can lead to a
> broken cluster, but at least it can cause pg_rewind to fail
> unnecessarily and in a user-unfriendly way. But this is actually
> pretty simple to fix. pg_rewind looks at the control file to find out
> the timeline the server is on. When promotion happens, the startup
> process updates minRecoveryPoint and minRecoveryPointTLI fields in the
> control file. We just need to read it from there. Patch attached.

Looks fine to me. A bit concerned about making sourceHistory
needlessly file-local but on the other hand unifying sourceHistory and
targetHistory looks better.

For the test part, that change doesn't necessariry catch the failure
of the current version, but I *believe* the prevous code is the result
of an actual failure in the past so the test probablistically (or
dependently on platforms?) hits the failure if it happned.

> I think we should also backpatch this. Back in 2015, we decided that
> we can live with this, but it's always been a bit bogus, and seems
> simple enough to fix.

I don't think this changes any successful behavior and it just saves
the failure case so +1 for back-patching.

> Thoughts?
>
> [1]
> https://www.postgresql.org/message-id/20150428180253.GU30322%40tamriel.snowman.net

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-12-08 05:01:38 Re: Blocking I/O, async I/O and io_uring
Previous Message Pavel Stehule 2020-12-08 04:26:08 Re: On login trigger: take three