Bogus WAL segments archived after promotion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Bogus WAL segments archived after promotion
Date: 2014-12-19 12:55:16
Message-ID: 54942034.7080303@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

When streaming replication was introduced in 9.0, we started to recycle
old WAL segments in archive recovery, like we do during normal
operation. The WAL segments are recycled on the current timeline. There
is no guarantee that they are useful, if the current timeline changes,
because we step to recover another timeline after that, or the standby
is promoted, but that was thought to be harmless.

However, consider what happens after a server is promoted, and WAL
archiving is enabled. The server's pg_xlog directory will look something
like this:

> -rw------- 1 heikki heikki 16777216 Dec 19 14:22 000000010000000000000005
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000006
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000007
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000008
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000009
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000A
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000B
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000C
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000D
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000E
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 00000001000000000000000F
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000010
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000011
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000012
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000013
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000014
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000015
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000016
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000017
> -rw------- 1 heikki heikki 16777216 Dec 19 14:23 000000010000000000000018
> -rw------- 1 heikki heikki 16777216 Dec 19 14:24 000000010000000000000019
> -rw------- 1 heikki heikki 16777216 Dec 19 14:22 00000001000000000000001A
> -rw------- 1 heikki heikki 16777216 Dec 19 14:22 00000001000000000000001B
> -rw------- 1 heikki heikki 16777216 Dec 19 14:22 00000001000000000000001C
> -rw------- 1 heikki heikki 16777216 Dec 19 14:24 000000020000000000000019
> -rw------- 1 heikki heikki 16777216 Dec 19 14:24 00000002000000000000001A
> -rw------- 1 heikki heikki 42 Dec 19 14:24 00000002.history

The files on timeline 1, up to 000000010000000000000019, are valid
segments, streamed from the primary or restored from the WAL archive.
The segments 00000001000000000000001A and 00000001000000000000001B are
recycled segments that haven't been reused yet. Their contents are not
valid (they contain records from some earlier point in WAL, but it might
as well be garbage).

The server was promoted within the segment 19, and a new timeline was
started. Segments 000000020000000000000019 and 00000002000000000000001A
contain valid WAL on the new timeline.

Now, after enough time passes that the bogus 00000001000000000000001A
and 00000001000000000000001B segments become old enough to be recycled,
the system will see that there is no .ready or .done file for them, and
will create .ready files so that they are archived. And they are
archived. That's bogus, because the files are bogus. Worse, if the
primary server where this server was forked off from continues running,
and creates the genuine 00000001000000000000001A and
00000001000000000000001B segments, it can fail to archive them if the
standby had already archived the bogus segments with the same names.

We must somehow prevent the recycled, but not yet used, segments from
being archived. One idea is to not create them in the first place, i.e.
don't recycle old segments during recovery, just delete them and have
new ones be created on demand. That's simple, but would hurt performance.

I'm thinking that we should add a step to promotion, where we scan
pg_xlog for any segments higher than the timeline switch point, and
remove them, or mark them with .done so that they are not archived.
There might be some real WAL that was streamed from the primary, but not
yet applied, but such WAL is of no interest to that server anyway, after
it's been promoted. It's a bit disconcerting to zap WAL that's valid,
even if doesn't belong to the current server's timeline history, because
as a general rule it's good to avoid destroying evidence that might be
useful in debugging. There isn't much difference between removing them
immediately and marking them as .done, though, because they will
eventually be removed/recycled anyway if they're marked as .done.

The archival behaviour at promotion is a bit inconsistent and weird
anyway; even valid, streamed WAL is marked as .done and not archived
anyway, except for the last partial segment. We're discussing that in
the other thread (Streaming replication and WAL archive interactions,
http://www.postgresql.org/message-id/689EB259-44C2-4820-B901-4F6B1C55A1E4@simply.name),
but it would be good have a small, back-patchable fix to prevent bogus
segments from being archived.

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-12-19 13:15:07 Re: Parallel Seq Scan
Previous Message Stephen Frost 2014-12-19 12:51:01 Re: Parallel Seq Scan