Re: pg_receivexlog creating holes in WAL (HEAD only)

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: pg_receivexlog creating holes in WAL (HEAD only)
Date: 2016-12-27 09:40:39
Message-ID: CABUevExh2p6j=uzfMRUvFdm59dDKNcMmM=3H1rknau8Uwa4dig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Dec 27, 2016 at 4:10 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

> Hi all,
>
> On HEAD, I am seeing the following thing with pg_receivexlog.
> 1) Run it, server is still on segment 1/0/1:
> $ pg_receivexlog -D ~/data/logs/ --verbose
> pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)
> $ psql -X -c 'select pg_current_xlog_location()'
> pg_current_xlog_location
> --------------------------
> 0/1510840
> (1 row)
> 2) Only a partial segment is being written:
> $ ls ~/data/logs/
> 000000010000000000000001.partial
> 3) Kill the PG instance, pg_receivexlog rightfully complains:
> pg_receivexlog: could not receive data from WAL stream: server closed
> the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> pg_receivexlog: disconnected; waiting 5 seconds to try again
> 2016-12-27 11:35:02 JST [42342]: [1-1] db=,user=,app=,client= DEBUG:
> logger shutting down
> pg_receivexlog: could not connect to server: could not connect to
> server: Connection refused
> Is the server running locally and accepting
> connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
> 4) Looking at the logs directory, the .partial segment went away:
> $ ls logs/
> 000000010000000000000001
> 5) Restart the PG instance, pg_receivexlog complains, again:
> pg_receivexlog: starting log streaming at 0/2000000 (timeline 1)
> pg_receivexlog: unexpected termination of replication stream: ERROR:
> requested starting point 0/2000000 is ahead of the WAL flush position
> of this server 0/15108B0
> pg_receivexlog: disconnected; waiting 5 seconds to try again
>
> Oops. Once the server switches to segment 1/0/2, it will then work
> again, but any content from the moment where the server has been
> killed to the moment it has reconnected has been lost. By that I mean
> the end of segment 1/0/1.
>
> The issue here is at step 4). When a connection failure happens,
> pg_receivexlog renames the .partial segment to be a completed segment.
> So at its next loop it considers the segment as sane, but it should be
> considered as corrupted, and it should not use it. On back-branches,
> 9.6 included, the .partial file is kept after a connection failure and
> streaming begins from the beginning of the partial segment, the last
> .partial file being overwritten. In short, this is an oversight from
> 56c7d8d, this bit to be precise:
> @@ -777,10 +679,10 @@ ReceiveXlogStream(PGconn *conn, StreamCtl *stream)
> }
>
> error:
> - if (walfile != -1 && close(walfile) != 0)
> + if (walfile != NULL && stream->walmethod->close(walfile,
> CLOSE_NORMAL) != 0)
> fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
> - progname, current_walfile_name, strerror(errno));
> - walfile = -1;
> + progname, current_walfile_name, stream->walmethod->
> getlasterror());
> + walfile = NULL;
>
> The problem is that when a temporary suffix is defined, CLOSE_NORMAL
> will rename the existing segment (this is not close_walfile()!).
> Attached is a patch to fix the problem, which is to just use
> CLOSE_NO_RENAME instead of CLOSE_NORMAL.
>

Hi!

Yeah, agreed, this seems like a pure bug in the walmethod patch. Probably
caused by copy/paste-replacing that interface way too many times during my
development.

Good spot, thanks! Applied.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message email 2016-12-27 17:23:07 BUG #14479: Casting non-constant values to INTERVAL DAY TO MINUTE doesn't truncate the seconds field
Previous Message Michael Paquier 2016-12-27 05:29:44 Re: [BUG] pg9.4.10 Logical decoding did not get the correct oldtuplelen