From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Michael Banck <michael(dot)banck(at)credativ(dot)de> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: gitlab post-mortem: pg_basebackup waiting for checkpoint |
Date: | 2017-02-11 10:07:59 |
Message-ID: | CABUevExpVYuLUgoNgYNNHxFmZqo3PuuaKgcVwYE5B5wCGScZkQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:
> Hi,
>
> one take-away from the Gitlab Post-Mortem[1] appears to be that after
> their secondary lost replication, they were confused about what
> pg_basebackup was doing when they tried to rebuild it. It just sat there
> and did nothing (even with --verbose), so they assumed something was
> wrong with either the primary or the connection, and restarted it
> several times.
>
> AFAICT, it turns out the checkpoint was written on the master (they
> probably did not use -c fast), but this wasn't obvious to them:
>
Yeah, I've seen this happen to a number of people. I think that sounds like
what's happened here as well. I've considered things in the line of the
patch you posted, but never got around to actually doing anything about it.
> ISTM that even with WAL streaming, nothing would be written on the
> client server until the checkpoint is complete, as do_pg_start_backup()
> runs the checkpoint and only returns the starting WAL location
> afterwards.
>
> The attached (untested) patch is to kick of a discussion on how to
> improve the situation, it is supposed to mention the checkpoint when
> --verbose is used and adds a paragraph about the checkpoint being run to
> the Notes section of the documentation.
>
>
Docs look good to me, other than claiming that pg_basebackup runs on a
server (it can run anywhere). I would just say "during which pg_basebackup
will appear idle". How does that sound to you?
As for the code, while I haven't tested it, isn't the "checkpoint
completed" message in the wrong place? Doesn't PQsendQuery() complete
immediately, and the check needs to be put *after* the PQgetResult() call?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Erik Rijkers | 2017-02-11 10:16:34 | Re: Logical replication existing data copy |
Previous Message | Michael Banck | 2017-02-11 09:38:09 | gitlab post-mortem: pg_basebackup waiting for checkpoint |