Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: chenhj <chjischj(at)163(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
Date: 2018-01-26 02:36:09
Message-ID: 20180126023609.GH17847@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 25, 2018 at 07:38:37AM -0500, Stephen Frost wrote:
> Michael, all,
>
> * Michael Paquier (michael(dot)paquier(at)gmail(dot)com) wrote:
>> On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
>>> * chenhj (chjischj(at)163(dot)com) wrote:
>>>> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost(at)snowman(dot)net> wrote:
>>>>> I've only read through the thread to try and understand what's going on
>>>>> and the first thing that comes to mind is that you're changing
>>>>> pg_rewind to not remove the WAL from before the divergence (split)
>>>>> point, but I'm not sure why. As noted, that WAL isn't needed for
>>>>> anything (it's from before the split, after all), so why keep it? Is
>>>>> there something in this optimization that depends on the old WAL being
>>>>> there and, if so, what and why?
>>>>
>>>> After run pg_rewind, the first startup of postgres will do crash recovery.
>>>> And crash recovery will begin from the previous redo point preceding the divergence.
>>>> So, the WAL after the redo point and before the divergence is needed.
>>>
>>> Right.
>>
>> Most of the time, and particularly since v11 has removed the need to
>> retain more past segments than one completed checkpoint, those segments
>> have less chances to be on the source server, limiting more the impact
>> of the patch discussed on this thread.
>
> Good point.

Actually, that is worse that than, because after doing a promotion one
has to issue a checkpoint on the promoted server so as to update its
control file (pg_rewind fetches the new TLI number from the on-disk
file), so we can never ensure that needed WAL segments will be on the
primary. Hence the proposed patch loses most of its value, because there
will always be a WAL segment hole anyway.

It can also help in reducing the load in creating new segments if
min_wal_size is large enough on the rewound server but that is a narrow
benefit.

In short, it seems really to me that we should reject the approach as
proposed, and replace it with something that prevents the fetching of
any WAL segments from the source server. I think that we could consider
as well removing all WAL segments on the primary from the point WAL
forked, as those created between the last checkpoint before WAL forked
up to the point where WAL forked are useful anyway. But those are bonus
points to keep a minimalistic amount of space for the rewound node as
they finish by being recycled anyway. For those reasons I think that the
patch should be marked as returned with feedback.

>> Yes, superuser is necessary now, if we could get to a point where only a
>> replication permission is needed that would be nice. Now we could do
>> things differently. We could have a system role dedicated to pg_rewind
>> which works only on the functions from genfile.c that pg_rewind needs,
>> in order to leverage the need of a superuser.
>
> Actually, the other work I'm doing nearby wrt removing the explicit
> superuser() checks in those functions would allow a non-superuser role
> to be created which could be used by pg_rewind. We could even add an
> explicit 'pg_rewind' default role if we wanted to, but is that the route
> we want to go with this or should we be thinking about changing
> pg_rewind to use the replication protocol and a replication user
> instead..? The only reason that it doesn't today is that there isn't an
> easy way for it to do so with the existing replication protocol, as I
> understand it.

Yes, actually with the piece of work we are doing, we don't need to work
more on the replication protocol, and pg_rewind does not require heavy
refactoring either, so from a maintenance point of view we would gain
much more in the long term by using a dedicated role. If your patch to
have the file-read catalog user gets in, we should add in the docs of
pg_rewind a small set of ACL commands to allow pg_rewind to work with a
non-superuser role as long is it gains access to the minimal set of
functions necessary.

> Then again, there's this whole question of if we should even keep the
> replication protocol or if we should be getting rid of it in favor of
> have regular connections that can support replication. There's been
> discussion of that recently, as I recall, though I can't remember
> where.

Hm. I don't recall this one..

> Having the rewound server decide by itself what it needs actually sounds
> like it would be better and would allow whatever the restore command is
> to fetch any missing WAL necessary (which it might be able to do more
> efficiently, from an archive server and possibly even in parallel as
> compared to pg_rewind doing it serially from the primary...). We don't
> have any particularly easy way to prevent needed WAL from disappearing
> off of the primary anyway, so it seems like it would be necessary to
> have a working restore command for pg_rewind to be reliable. While we
> could create a physical replication slot for pg_rewind when it starts,
> that may already be too late.

The only approach I can think of that makes unnecessary an external
archive is to have a slot that guarantees that WAL segments down to the
previous checkpoint before promotion are kept around to allow a rewound
node to use them. For planned failovers, things can be controlled even
today as you know when to create a slot. Unplanned failovers are where
the problems come. One way I think can address any problems is to have
what I would call a failover slot. The concept is simple: when a node is
in recovery have it create a new physical replication slot at each
restart point which has a user-defined name, controlled by a GUC, and do
the creation automatically. If the parameter is empty, no slots are
created. At the beginning of a restart point, the slot is removed, so
once the node is promoted the slot will not get removed, and it would
keep around WAL segments down to the last redo point before promotion
automatically.

When you don't want to set up a WAL archive, but still want to ensure
that a rewound node is able to catch up with its promoted primary, just
enable this failover slot and then set up primary_slot_name in
recovery.conf to use it.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-01-26 02:41:25 Re: [HACKERS] Subscription code improvements
Previous Message Stephen Frost 2018-01-26 02:28:13 Re: [HACKERS] REINDEX CONCURRENTLY 2.0