Re: Standby trying "restore_command" before local WAL

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Emre Hasegeli <emre(at)hasegeli(dot)com>, Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "berge(at)trivini(dot)no" <berge(at)trivini(dot)no>, Gürkan Gür <ben(at)gurkan(dot)in>, Raimund Schlichtiger <raimund(dot)schlichtiger(at)innogames(dot)com>, Bernhard Schrader <bernhard(dot)schrader(at)innogames(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject: Re: Standby trying "restore_command" before local WAL
Date: 2018-08-07 12:25:18
Message-ID: 1c73846e-272f-b98a-329d-0a385f747182@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/06/2018 09:32 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 08/06/2018 06:11 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>>>> On 08/06/2018 05:19 PM, Stephen Frost wrote:
>>>>> * David Steele (david(at)pgmasters(dot)net) wrote:
>>>>>> I think for the stated scenario (known good standby that has been
>>>>>> shutdown gracefully) it makes perfect sense to trust the contents of
>>>>>> pg_wal. Call this scenario #1.
>>>>>>
>>>>>> An alternate scenario (#2) is that the data directory was copied using a
>>>>>> basic copy tool and the pg_wal directory was not excluded from the copy.
>>>>>> This means the contents of pg_wal will be in an inconsistent state.
>>>>>> The files that are there might be partials (not with the extension,
>>>>>> though) and you can easily have multiple partials. You will almost
>>>>>> certainly not have everything you need to get to consistency.
>>>>
>>>> Yeah. But as Simon said, we do have fairly strong protections about applying
>>>> corrupted WAL - every record is CRC-checked. So why not to fall-back to the
>>>> restore_command only if the locally available WAL is not fully consistent?
>>>
>>> "Corrupted" doesn't necessairly only mean "the data file was munged by
>>> the storage somehow." In this case, corrupted could be an old and only
>>> partial WAL file, in which case we'd possibly be missing WAL that needs
>>> to be replayed to bring the cluster back to a valid state, no?
>>
>> Why wouldn't that be detected by checksums? Also, why wouldn't this be
>> handled correctly by the logic I proposed, i.e. falling-back to remote WAL
>> segment if the file is incomplete/broken in some sense?
>
> WAL checksums are per WAL record, not across the whole file... And,
> yes, this seems like something we could probably write code to ensure
> we're doing correctly, possibly even without changing the existing WAL
> format or maybe we would have to, but either way, seems like additional
> code that would need to be written and some serious thought put into it
> to make sure it's correct. I'm all for that, just to be clear, but what
> I don't think we can do is move forward with a change to just prefer
> pg_wal over restore command.
>

While the checksums are per-record, the record also contains xl_prev, so
it's effectively a chain of checksums covering the whole file. The other
thing you can do is look at the first record of the next segment and use
the xl_prev to detect if the previous segment was not partial.

But I do agree doing this properly may require writing some new code and
checking that those cases are detected correctly.

(Note: There was a discussion about replacing xl_prev with LSN of the
current record, and IMHO that would work just as well, although it might
be a bit more expensive for some operations.)

>>>>>> But there's another good scenario (#3): where the pg_wal directory was
>>>>>> preloaded with all the WAL required to make the cluster consistent or
>>>>>> all the WAL that was available at restore time. In this case, it would
>>>>>> be make sense to prefer the contents of pg_wal and only switch to
>>>>>> restore_command after that has been exhausted.
>>>>>>
>>>>>> So, the choice of whether to prefer locally-stored or
>>>>>> restore_command-fetched WAL is context-dependent, in my mind.
>>>>>
>>>>> Agreed.
>>>>
>>>> Maybe, not sure.
>>>
>>> The argument that David makes above in scenario #2 certainly looks
>>> entirely likely to me and I don't think we've got any real protections
>>> against that. The current common use-cases happen to work around the
>>> risk because tools like pg_basebackup ignore the existing pg_wal
>>> directory when doing the backup and instead populate it with exactly the
>>> correct WAL that's needed, and in cases where a restore command is
>>> specified will always pull back only valid WAL, but I don't think we can
>>> decide that this scenario (#2 from above):
>>
>> I'm probably missing something, but why couldn't we detect that using CRC.
>> Or make sure we can detect that, e.g. by adding some additional info into
>> each WAL segment?
>
> CRC's are per WAL record, and possibly some WAL records might not be ok
> to replay, or at least we need to make sure that we replay the right set
> of WAL in the right order even when there are partial WAL files being
> given to PG (that aren't named that way...). The more I think about
> this, I think we really need to avoid partial WAL files entirely- what
> are we going to do when we get to the end of one? We'd need to request
> the full one from the restore command anyway, so seems like we should
> just go ahead and get it from the archive, the question is if there's an
> easy/cheap way to detect partial WAL files in pg_wal.
>

As explained above, I don't think this is actually a problem. The
checksums do cover the whole file thanks to chaining, and there are ways
to detect partial segments. IMHO it's fine if we replay a segment and
then find out it was partial and that we need to fetch it from archive
anyway and re-apply it - it should not be very common case, except when
the user does something silly.

>>>>>> Ideally we could have a default that is safe in each scenario with
>>>>>> perhaps an override if the user knows better. Scenario #1 would allow
>>>>>> WAL to be read from pg_wal by default, scenario #2 would prefer fetched
>>>>>> WAL, and scenario #3 could use a GUC to override the default fetch behavior.
>>>>>
>>>>> Not sure how we'd be able to automatically realize which scenario we're
>>>>> in though..?
>>>>
>>>> But do we need to know it? I mean, can't we try the local WAL first, use it
>>>> if it passes the CRC checks (and possibly some other checks), and only
>>>> fallback to the remote WAL if it's identified as broken?
>>>
>>> Maybe- but I think we need to be quite sure about that and I don't
>>> believe that just checking the CRCs is enough.
>>
>> Sure. So what else would we need to add to each WAL segment to make that
>> possible? It needs to be cheap enough not to cause perf regression, but e.g.
>> a CRC of each segment + amount of data actually stored in it should be
>> sufficient.
>
> As I mention above, seems like what we should really be doing is having
> a way to know when a WAL file is full but still in pg_wal for some
> reason (hasn't been replayed yet due to intential lag on the replica, or
> unintentional lag on the replica, etc), and perhaps we even only do that
> on replicas or have tools like pg_basebackup and pgbackrest do that when
> they're doing a restore, so that it's clear to PG that all these files
> are full WAL and they can replay them completely.
>

IMHO we can deduce if from first xl_prev of the next segment (assuming
we have the next segment available locally, which we likely have except
for the last one).

>> Another idea - what if we instead allow fetching additional information of
>> the archived WAL, and make decision based on that? For example, before
>> restore_command we could request md5 checksum from the archive, compare it
>> to the local WAL, and if they match then use the local one. Otherwise
>> request the remote one. That'd still be a win.
>
> I was actually just thinking about having pgbackrest do exactly that. :)
> We already have the full sha256 checksum of every WAL file in the
> pgbackrest repository, it seems like it wouldn't be too hard to
> calculate the checksum of the files in pg_wal and ask the repo what the
> checksums are for those WAL files and then use the files in pg_wal if
> they checksums match. I can already imagine the argument coming back
> though- that'd require additional testing to ensure it's done correctly
> and I'm really not sure that it'd end up being all that much better
> except in some very special circumstances where there's a low bandwidth
> connection to the repo and often a lot of WAL left over in pg_wal.
>

I don't think it can be done in an external tool entirely, at least not
using restore_command alone. We remove the local segment before even
invoking restore_command, so it's too late to check checksums etc.

We'd need invoking some other command before restore_command, which
would do this comparison and decide whether to use local or remote WAL.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lætitia Avrot 2018-08-07 12:29:43 Re: Constraint documentation
Previous Message Jesper Pedersen 2018-08-07 12:14:08 Re: partition tree inspection functions