gitlab post-mortem: pg_basebackup waiting for checkpoint

From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: gitlab post-mortem: pg_basebackup waiting for checkpoint
Date: 2017-02-11 09:38:09
Message-ID: 1486805889.24568.96.camel@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

one take-away from the Gitlab Post-Mortem[1] appears to be that after
their secondary lost replication, they were confused about what
pg_basebackup was doing when they tried to rebuild it. It just sat there
and did nothing (even with --verbose), so they assumed something was
wrong with either the primary or the connection, and restarted it
several times.

AFAICT, it turns out the checkpoint was written on the master (they
probably did not use -c fast), but this wasn't obvious to them:

"One of the engineers went to the secondary and wiped the data
directory, then ran pg_basebackup. Unfortunately pg_basebackup would
hang, producing no meaningful output, despite the --verbose option being
set."

[...]

"Unfortunately this did not resolve the problem of pg_basebackup not
starting replication immediately. One of the engineers decided to run it
with strace to see what it was blocking on. strace showed that
pg_basebackup was hanging in a poll call, but that did not provide any
other meaningful information that might have explained why."

[...]

"It would later be revealed by another engineer (who wasn't around at
the time) that this is normal behavior: pg_basebackup will wait for the
primary to start sending over replication data and it will sit and wait
silently until that time. Unfortunately this was not clearly documented
in our engineering runbooks nor in the official pg_basebackup document."

ISTM that even with WAL streaming, nothing would be written on the
client server until the checkpoint is complete, as do_pg_start_backup()
runs the checkpoint and only returns the starting WAL location
afterwards.

The attached (untested) patch is to kick of a discussion on how to
improve the situation, it is supposed to mention the checkpoint when
--verbose is used and adds a paragraph about the checkpoint being run to
the Notes section of the documentation.

Michael

[1]https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Attachment Content-Type Size
pg_basebackup.patch text/x-patch 1.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2017-02-11 10:07:59 Re: gitlab post-mortem: pg_basebackup waiting for checkpoint
Previous Message Fabien COELHO 2017-02-11 07:43:12 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)