Re: Standby trying "restore_command" before local WAL

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, berge(at)trivini(dot)no, Gürkan Gür <ben(at)gurkan(dot)in>, Raimund Schlichtiger <raimund(dot)schlichtiger(at)innogames(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Bernhard Schrader <bernhard(dot)schrader(at)innogames(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject: Re: Standby trying "restore_command" before local WAL
Date: 2018-07-31 15:56:45
Message-ID: 20180731155645.GC27724@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Emre Hasegeli (emre(at)hasegeli(dot)com) wrote:
> This issue came to our attention after we migrated an application from
> an object storage backend, and noticed that restarting a standby node
> takes hours or sometimes days.
>
> We are using shared WAL archive and find it practical to have
> "restore_command" configured in case we would left a standby offline
> for a long time. However, during the short window the standby is
> restarted, the master manages to archive a segment. Then,
> the standby executes "restore_command" successfully, and continues
> downloading WAL from the archive causing the recovery to take orders
> of magnitude longer.

Why is it taking so long for the restore command to return the WAL
file..? Is the issue that the primary is generating enough WAL that
whatever you're using for the restore command isn't able to feed PG the
WAL fast enough? We ran into exactly that problem with users of
pgbackrest and took a different approach- we parallelized pulling down
the WAL files into a local queue to be able to feed PG's replay as fast
as possible. We've also moved to C for the parts of the archive-get
and archive-push commands that are run out of archive_command and
restore_command to make them as fast as possible (and are working to
finish rewriting the rest into C as well).

> == The Workarounds ==
>
> We can possibly work around this inside the "restore_command" or
> by delaying the archiving. Working around inside the "restore_command"
> would involve checking whether the file exists under pg_wal/. This
> should not be easy because the WAL file may be written partially. It
> should be easier for Postgres to do this as it knows where to stop
> processing the local WAL.

Yeah, handing PG a partially written WAL from the existing pg_wal
directory sounds like a rather bad idea..

> It should also be possible to work around this problem by delaying
> archiving using "wal_keep_segments", or replication slots, or simply
> with sleep(). Though, none of those is the correct solution to
> the problem. We don't need the master to keep more segments for
> the standbys. We already have more than enough.

Right, you certainly don't want to hope that wal_keep_segments is
enough or to keep extra WAL on the primary. You could use a replication
slot to avoid the wal_keep_segments risk, but then you're still going to
have potentially a lot of unnecessary WAL on the primary if the replica
has been offline.

> == The Change ==
>
> This "restore_command" behavior is coming from the initial archiving
> and point-in-time-recovery implementation [2]. The code says
> "the reason is that the file in XLOGDIR could be an old, un-filled or
> partly-filled version that was copied and restored as part of
> backing up $PGDATA." This was probably a good reason in 2004, but
> I don't think it still is. AFAIK "pg_basebackup" eliminates this
> problem. Also, with this reasoning, we should also try streaming from
> the master before trying the local WAL, but AFAIU we don't.

Just because pg_basebackup wouldn't necessairly create that risk doesn't
mean that the risk has gone away; pg_basebackup isn't the only way for a
replica to be brought up.

> If there will be a consensus on fixing this, I can try to prepare
> a patch. The company, I am currently working for, is also interested
> in sponsoring a support company to fix this problem.

As mentioned by others, it sounds like we could have an option to try
contacting the primary before running restore_commnad (or, more
generally, an option which allows a user to specify their preference
regarding what PG should try to do first), but I'm trying to figure out
how going to the primary is going to solve your problem- if the replica
is offline for long enough, then the WAL isn't going to be available on
the primary as you outlined above and the WAL won't be local either
(since the replica wasn't online while the primary was creating WAL..),
so the only option would be to go to the restore_command.

As such, it seems like a faster restore_command would end up being a
better and more general solution.

That said, I don't have any particular issue with a patch to allow the
user to tell PG to try asking the primary first. Of course, that
wouldn't be available until v12.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Kornilov 2018-07-31 16:39:24 Re: Standby trying "restore_command" before local WAL
Previous Message Pavel Luzanov 2018-07-31 15:30:48 Re: Usability fail with psql's \dp command