pg_receivexlog creating holes in WAL (HEAD only)

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>
Subject: pg_receivexlog creating holes in WAL (HEAD only)
Date: 2016-12-27 03:10:35
Message-ID: CAB7nPqQXhAgo9LGgGwf0h-KGratuc1OvcbZ21Ko1RqxXGxkppA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi all,

On HEAD, I am seeing the following thing with pg_receivexlog.
1) Run it, server is still on segment 1/0/1:
$ pg_receivexlog -D ~/data/logs/ --verbose
pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)
$ psql -X -c 'select pg_current_xlog_location()'
pg_current_xlog_location
--------------------------
0/1510840
(1 row)
2) Only a partial segment is being written:
$ ls ~/data/logs/
000000010000000000000001.partial
3) Kill the PG instance, pg_receivexlog rightfully complains:
pg_receivexlog: could not receive data from WAL stream: server closed
the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_receivexlog: disconnected; waiting 5 seconds to try again
2016-12-27 11:35:02 JST [42342]: [1-1] db=,user=,app=,client= DEBUG:
logger shutting down
pg_receivexlog: could not connect to server: could not connect to
server: Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
4) Looking at the logs directory, the .partial segment went away:
$ ls logs/
000000010000000000000001
5) Restart the PG instance, pg_receivexlog complains, again:
pg_receivexlog: starting log streaming at 0/2000000 (timeline 1)
pg_receivexlog: unexpected termination of replication stream: ERROR:
requested starting point 0/2000000 is ahead of the WAL flush position
of this server 0/15108B0
pg_receivexlog: disconnected; waiting 5 seconds to try again

Oops. Once the server switches to segment 1/0/2, it will then work
again, but any content from the moment where the server has been
killed to the moment it has reconnected has been lost. By that I mean
the end of segment 1/0/1.

The issue here is at step 4). When a connection failure happens,
pg_receivexlog renames the .partial segment to be a completed segment.
So at its next loop it considers the segment as sane, but it should be
considered as corrupted, and it should not use it. On back-branches,
9.6 included, the .partial file is kept after a connection failure and
streaming begins from the beginning of the partial segment, the last
.partial file being overwritten. In short, this is an oversight from
56c7d8d, this bit to be precise:
@@ -777,10 +679,10 @@ ReceiveXlogStream(PGconn *conn, StreamCtl *stream)
}

error:
- if (walfile != -1 && close(walfile) != 0)
+ if (walfile != NULL && stream->walmethod->close(walfile, CLOSE_NORMAL) != 0)
fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
- progname, current_walfile_name, strerror(errno));
- walfile = -1;
+ progname, current_walfile_name, stream->walmethod->getlasterror());
+ walfile = NULL;

The problem is that when a temporary suffix is defined, CLOSE_NORMAL
will rename the existing segment (this is not close_walfile()!).
Attached is a patch to fix the problem, which is to just use
CLOSE_NO_RENAME instead of CLOSE_NORMAL.

Thanks,
--
Michael

Attachment Content-Type Size
receivexlog-kill-fix.patch invalid/octet-stream 603 bytes

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message anderson 2016-12-27 03:30:47 [BUG] pg9.4.10 Logical decoding did not get the correct oldtuplelen
Previous Message Nathanael Terrien 2016-12-26 16:47:29 Re: BUG #14478: Right Trim trims too much?