Archiving done right

From: David Steele <david(at)pgmasters(dot)net>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Archiving done right
Date: 2015-08-21 14:25:00
Message-ID: 55D734BC.9000602@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I split this into a new thread to talk about archiving best practices.

On 8/18/15 2:25 PM, Stephen Frost wrote:
> * Stephen Frost (sfrost(at)snowman(dot)net) wrote:
>> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>>> On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut <peter_e(at)gmx(dot)net>
wrote:
>>>> The commit message for de76884 contains some important information
about
>>>> the purpose and use of the new .partial WAL files. But I don't see
>>>> anything about this in the documentation or another user-visible place.
>>>> We should probably add something.
>>>
>>> +1. Heikki talked about this at his PGCon presentation, and I thought
>>> that was pretty helpful stuff, but not everyone was there (or will
>>> remember what he said when the time comes that they need to know).
>>
>> Hopefully the audio or video of it will be posted eventually.. I was
>> hoping to review it myself and to discuss the points he made in depth
>> with David, to make sure we cover all of them (pretty sure we do, but
>> good to verify).

I finally had a chance to watch Heikki's lecture on warm standbys from
PGCon. It's clear to me that the "correct" archiving of WAL segments is
in many ways the core issue.

I've been working on this problem for quite some time in the context of
my work with pgBackRest so I thought it might be good to sum up the
salient points from the talk, where I've gotten with implementation, and
what remains to be done. Hopefully this will provoke some comments.

1) The archiver should fsync the file before returning success

I took it one step further at Andres' recommendation:

a. copy the WAL segment to a temp file
b. fsync the file
c. rename the WAL segment to the final file name
d. fsync the directory

2) The archiver should accept duplicate WAL segments

Since archive_status is not copied during backups there are scenarios
outside of standby where Postgres may try to archive a WAL segment that
has already been archived.

Heikki mentioned some time ago that a good archiver should allow
duplicates without complaint (I'm not sure of the thread, but something
to do with the 9.5 work). Since pgBackRest calculates SHA1 hashes for
everything it handles this was pretty straight-forward to implement.
The WAL file name and hash are compared against the archive - if it's an
exact match then 0 is returned to Postgres.

3) The archiver should accept partial segments

Here's where it gets interesting. 9.5 will label the partial WAL
segment with a special extension but previous versions do not. This
left a choice - rename or delete the conflicting WAL segment that was
already in the archive - IF the master had already archived it.

Controlled failovers are almost guaranteed to have the conflicting
segment archived because of the graceful shutdown, emergency failovers
somewhat less so. It really depends on the nature of the failure.

One technique < 9.5 would be to accept duplicate WAL and just assign a
different name (.dup1, .dup2, ...) but I decided not to do that because
it might mask real issues and the delete/rename WAL segment is a
one-time deal after promotion. So far I've stuck with that.

Heikki's solution for 9.5 is clearly better but pgBackRest currently
supports Postgres >= 8.3 so it will be many years before this can be the
only solution.

Anyone have ideas on what could be done here in an automated way for
older versions?

4) On a possible archive gap after promotion

If the master was behind on archiving before it died (a very possible
scenario) then there could be a gap in archive between where the master
left off and the standby began.

In the current scheme, if the standby has not been archiving and then
gets promoted, won't it try to push the older WAL segments because the
.done files do not exist in archive_status? Or does the archiver not
cross timelines?

I don't have specific tests for this yet but I thought that was the
behavior I had seen in the past.

Of course, this is not fool-proof because some of the WAL segments in
the gap may have already been recycled on the standby.

As Heikki mentioned, this could be solved with pg_receivexlog running on
the backup server (assuming you have one). I've resisted integrating
this tool with pgBackRest but now I realize that it could work in
concert with archive_command to provide a more resilient solution.

5) Which brings us to archive_mode = always

I already have tests to make sure that the archiver and asynchronous
archiver do not conflict and I think this can easily be extended to
multiple synchronous/asynchronous archivers against a single archive. I
believe it may already be "safe" but one process might receive an error
depending on timing.

pgBackRest has its own protocol and remote process so taking locks to
prevent race conditions works either locally or remotely. If the Linux
kernel is sufficiently recent (>= 2.6.12) it should work on NFS as well
but I have not tested it yet.

6) And lastly, archiving to the correct archive

During the talk somebody mentioned the possibility of archiving to the
wrong location and mixing up two databases' WAL segments. pgBackRest
prevents this by reading the xlog version and database system id out of
the WAL header and comparing that to what is already in the archive.

--
-David
david(at)pgmasters(dot)net

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-08-21 14:51:07 Re: Make HeapTupleSatisfiesMVCC more concurrent
Previous Message Robert Haas 2015-08-21 14:10:42 Re: [DESIGN] ParallelAppend