Re: Standby trying "restore_command" before local WAL

From: David Steele <david(at)pgmasters(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "berge(at)trivini(dot)no" <berge(at)trivini(dot)no>, Gürkan Gür <ben(at)gurkan(dot)in>, Raimund Schlichtiger <raimund(dot)schlichtiger(at)innogames(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Bernhard Schrader <bernhard(dot)schrader(at)innogames(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject: Re: Standby trying "restore_command" before local WAL
Date: 2018-08-03 14:16:54
Message-ID: a12bf402-9588-f11a-c91b-ab0f5c90a1ca@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 8/2/18 4:08 PM, Robert Haas wrote:
> On Wed, Aug 1, 2018 at 7:14 AM, Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:
>>> There's still a question here, at least from my perspective, as to which
>>> is actually going to be faster to perform recovery based off of. A good
>>> restore command, which pre-fetches the WAL in parallel and gets it local
>>> and on the same filesystem, meaning that the restore_command only has to
>>> execute essentially a 'mv' and return back to PG for the next WAL file,
>>> is really rather fast, compared to streaming that same data over the
>>> network with a single TCP connection to the primary. Of course, there's
>>> a lot of variables there and it depends on the network speed between the
>>> various pieces, but I've certainly had cases where a replica catches up
>>> much faster using restore command than streaming from the primary.
>>
>> Trying "restore_command" before streaming replication is totally fine.
>> It is not likely that the same WAL would be on both places anyway.
>>
>> My problem is trying "restore_command" before the local WAL. I
>> understand the historic reason of this design, but I don't think it is
>> expected behavior to anybody who is using "restore_command" together
>> with streaming replication.
>
> Right. I don't really understand the argument that this should be
> controlled by a GUC. I could see having a GUC to choose between
> archiving-first and streaming-first, but if it's safe to use the WAL
> we've already got in pg_xlog, it seems like that should take priority
> over every other approach. The comments lend themselves to a certain
> amount of doubt over whether we can actually trust the contents of
> pg_xlog, but if we can't, it seems like we just shouldn't use it - at
> all - ever. It doesn't make much sense to me to say "hey, pg_xlog
> might have evil bad WAL in it that we shouldn't replay, so let's look
> for the same WAL elsewhere first, but then if we don't find it, we'll
> just replay the bad stuff."

I think for the stated scenario (known good standby that has been
shutdown gracefully) it makes perfect sense to trust the contents of
pg_wal. Call this scenario #1.

An alternate scenario (#2) is that the data directory was copied using a
basic copy tool and the pg_wal directory was not excluded from the copy.
This means the contents of pg_wal will be in an inconsistent state.
The files that are there might be partials (not with the extension,
though) and you can easily have multiple partials. You will almost
certainly not have everything you need to get to consistency.

But there's another good scenario (#3): where the pg_wal directory was
preloaded with all the WAL required to make the cluster consistent or
all the WAL that was available at restore time. In this case, it would
be make sense to prefer the contents of pg_wal and only switch to
restore_command after that has been exhausted.

So, the choice of whether to prefer locally-stored or
restore_command-fetched WAL is context-dependent, in my mind.

Ideally we could have a default that is safe in each scenario with
perhaps an override if the user knows better. Scenario #1 would allow
WAL to be read from pg_wal by default, scenario #2 would prefer fetched
WAL, and scenario #3 could use a GUC to override the default fetch behavior.

Regards,
--
-David
david(at)pgmasters(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2018-08-03 14:24:50 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Robert Haas 2018-08-03 14:14:04 Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.