Re: gitlab post-mortem: pg_basebackup waiting for checkpoint

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gitlab post-mortem: pg_basebackup waiting for checkpoint
Date: 2017-02-11 10:07:59
Message-ID: CABUevExpVYuLUgoNgYNNHxFmZqo3PuuaKgcVwYE5B5wCGScZkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael(dot)banck(at)credativ(dot)de>
wrote:

> Hi,
>
> one take-away from the Gitlab Post-Mortem[1] appears to be that after
> their secondary lost replication, they were confused about what
> pg_basebackup was doing when they tried to rebuild it. It just sat there
> and did nothing (even with --verbose), so they assumed something was
> wrong with either the primary or the connection, and restarted it
> several times.
>
> AFAICT, it turns out the checkpoint was written on the master (they
> probably did not use -c fast), but this wasn't obvious to them:
>

Yeah, I've seen this happen to a number of people. I think that sounds like
what's happened here as well. I've considered things in the line of the
patch you posted, but never got around to actually doing anything about it.

> ISTM that even with WAL streaming, nothing would be written on the
> client server until the checkpoint is complete, as do_pg_start_backup()
> runs the checkpoint and only returns the starting WAL location
> afterwards.
>
> The attached (untested) patch is to kick of a discussion on how to
> improve the situation, it is supposed to mention the checkpoint when
> --verbose is used and adds a paragraph about the checkpoint being run to
> the Notes section of the documentation.
>
>
Docs look good to me, other than claiming that pg_basebackup runs on a
server (it can run anywhere). I would just say "during which pg_basebackup
will appear idle". How does that sound to you?

As for the code, while I haven't tested it, isn't the "checkpoint
completed" message in the wrong place? Doesn't PQsendQuery() complete
immediately, and the check needs to be put *after* the PQgetResult() call?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2017-02-11 10:16:34 Re: Logical replication existing data copy
Previous Message Michael Banck 2017-02-11 09:38:09 gitlab post-mortem: pg_basebackup waiting for checkpoint