Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>
Subject: Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)
Date: 2022-05-03 21:17:12
Message-ID: 20220503211712.GE15249@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Bharath Rupireddy (bharath(dot)rupireddyforpostgres(at)gmail(dot)com) wrote:
> On Tue, Apr 26, 2022 at 12:09 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > I was thinking more specifically along the lines of "if there's > X GB
> > of WAL that hasn't been archived, give up on archiving anything new"
> > (which is how the pgbackrest option works).
>
> IMO, this option is dangerous because the users might lose the
> unarchived WAL files completely. Instead, fixing the archive_command
> either before the server runs out of disk space (if someone/some
> monitoring detected the archive_command failure) or after the server
> crashes with no space.

Certainly, fixing the archive_command would be the best answer. We're
here because we know that doesn't always happen and are looking for
other ways to avoid the server crashing or to at least help clean things
up if a crash does happen.

> > As archiving with this command is optional, it does present the same
> > risk too. Perhaps if we flipped it around to require the
> > archive_command be provided then it'd be a bit better, though we would
> > also need a way for users to ask for us to just delete the WAL without
> > archiving it. There again though ... the server already has a way of
> > both archiving and removing archived WAL and also has now grown the
> > archive_library option, something that this tool would be pretty hard to
> > replicate, I feel like, as it wouldn't be loading the library into a PG
> > backend anymore. As we don't have any real archive libraries yet, it's
> > hard to say if that's going to be an actual issue or not. Something to
> > consider though.
>
> Actually, why are we just thinking that the WAL files grow up only
> because of archive command failures (of course, it's the main cause
> even in a well-configured server)?. Imagine if the checkpoints weren't
> happening frequently for whatever reasons or the
> max_slot_wal_keep_size or wal_keep_size weren't set appropriately
> (even if they set properly, someone, could be an attacker, can reset
> them), high write activity generating huge WAL files (especially on
> smaller servers) - these too can be reasons for server going down
> because of WAL files growth.

The option that I outline would probably be the "end" of all of the
above- that is, its job is to absolutely keep the pg_wal filesystem from
running out of space by making sure that pg_wal doesn't grow beyond the
configured size. That might mean that we don't keep wal_keep_size
amount of WAL or maybe get rid of WAL before max_slot_wal_keep_size. If
checkpoints aren't happening, that's different and throwing away WAL
would mean corruption on the primary and so we'd just have to accept a
crash in that case but I don't think that's anywhere near as common as
the other cases. We would probably want to throw warnings if someone
configured wal_keep_size bigger than throw_away_unneeded_wal or
whatever.

> The main motto of this tool is to use (by DBAs or service engineers)
> it as a last resort option after the server goes out of disk space to
> quickly bring the server back online so that the regular cleaning up
> activity can take place or the storage scale operations can be
> performed if required. There's a dry run option in pg_walcleaner to
> see if it can free up some space at all. If required we can provide an
> option to archive and delete only the maximum of specified number of
> WAL files.

Unfortunately, it's not likely to be very quick if archive command is
set to anything real. Interesting idea to have an option along the
lines of "please archive X amount of WAL", but there again- the server
could, in theory anyway, start up and get a few WAL segments archived
and then start doing replay from the crash, and if it ran out of space
and archiving was still happening, it could perhaps just wait and try
again. With the ALTER SYSTEM READ ONLY, it could maybe even just go
read-only if it ran out of space in the first place and continue to try
and archive WAL. Kind of hard to say how much additional space is
needed to get through crash recovery and so this suggested option would
always end up having to be a pretty broad guess and you absolutely
couldn't start the server while this command is running (which I would
definitely be worried would end up happening...), so that's also
something to think about.

> I'm okay with making the archive_command mandatory and if archiving
> isn't required to be used with the pg_walcleaner tool they can set
> some dummy values such as archive_command='/bin/true'.

Yeah, that seems at least a bit better. I imagine a lot of people will
probably use it exactly like that, but this will hopefully at least
discourage them from just throwing it away.

> > > The initial version of the patch doesn't check if the server crashed
> > > or not before running it. I was thinking of looking at the
> > > postmaster.pid or pg_control file (however they don't guarantee
> > > whether the server is up or crashed because the server can crash
> > > without deleting postmaster.pid or updating pg_control file). Another
> > > idea is to let pg_walcleaner fire a sample query ('SELECT 1') to see
> > > if the server is up and running, if yes, exit, otherwise proceed with
> > > its work.
> >
> > All of which isn't an issue if we don't have an external tool trying to
> > do this and instead have the server doing it as the server knows its
> > internal status, that the archive command has been failing long enough
> > to pass the configuration threshold, and that the WAL isn't needed for
> > crash recovery. The ability to avoid having to crash and go through
> > that process is pretty valuable. Still, a crash may still happen and
> > it'd be useful to have a clean way to deal with it. I'm not really a
> > fan of having to essentially configure this external command as well as
> > have the server configured. Have we settled that there's no way to make
> > the server archive while there's no space available and before trying to
> > write out more data?
>
> The pg_walcleaner tool isn't intrusive in the sense that it doesn't
> delete the WAL files that are required for the server to come up (as
> it checks for the checkpoint redo WAL file), apart from this it has
> archive_command too so no loss of the WAL file(s) at all unlike the
> pgbackrest option.

Won't be any WAL loss with pgbackrest unless it's specifically
configured to throw it away- again, it's a tradeoff. Just suggesting
that we could have that be part of core as an option.

> > > Also, to not cause losing of WAL permanently, we must recommend using
> > > archvie_command so that the WAL can be moved to an alternative
> > > location (could be the same archvie_location that primary uses).
> >
> > I agree we should recommend using archive_command or archive_library, of
> > course, but if that's been done and is working properly then this tool
> > isn't really needed. The use-case we're trying to address, I believe,
> > is something like:
> >
> > 1) archive command starts failing for some reason
> > 2) WAL piles up on the primary
> > 3) primary runs out of disk space, crash happens
> > 4) archive command gets 'fixed' in some fashion
> > 5) WAL is archived and removed from primary
> > 6) primary is restarted and able to go through crash recovery
> > 7) server is online again
> >
> > Now, I was trying to suggest an approach to addressing the issue at #2,
> > that is, avoid having WAL pile up without end on the primary and avoid
> > the crash in the first place. For users who care more about uptime and
> > less about WAL, that's likely what they want.
> >
> > For users who care more about WAL than uptime, it'd be good to have a
> > way to help them too, but to do that, #4 has to happen and, once that's
> > done, #5 and following just need to be accomplished in whatever way is
> > simplest. The thought I'm having here is that the simplest answer, at
> > least from the user's perspective, is that the server is able to just be
> > brought up with the fixed archive command and everything just works-
> > archiving happens, space is free'd up, and the server comes up and
> > continues running.
> >
> > I accept that it isn't this patch's job or mandate to go implement some
> > new option that I've thought up, and I could be convinced that this
> > separate tool is just how we're going to have to get #5 accomplished for
> > now due to the complexity of making the server do the archiving early on
> > and/or that it has other downsides (if the crash wasn't due to running
> > out of space, making the server wait to come online until after the WAL
> > has been archived wouldn't be ideal) that make it a poor choice overall,
> > but it seems like it's something that's at least worth some thought and
> > consideration of if there's a way to accomplish this with a bit less
> > manual user involvement, as that tends to be error-prone.
>
> As I said above, let's not just think that the archive_command
> failures are the only reasons (of course they are the main causes) for
> WAL file growth.

I agree that we should consider other reasons and how the suggested
option might interact with the other existing options we have. This
doesn't really address the point that I was making though.

> The pg_walcleaner proposed here is a better option IMO for the
> developers/DBAs/service engineers to see if they can quickly bring up
> the crashed server avoiding some manual work of looking at which WAL
> files can be deleted as such. Also if they fixed the failed
> archive_command or wrong settings that caused the WAL files growth,
> the server will be able to do the other cleanup and come online
> cleanly. I'm sure many would have faced this problem of server crashes
> due to out of disk space at some point in time in production
> environments, especially with the older versions of postgres.

For some folks, it'd be better to avoid having a crash at all, but of
course there's downsides to that too.

For others, it strikes me as clearly simpler if they could just fix the
archive_command and start PG again rather than having to run some other
command while making sure that PG isn't running (and doesn't end up
getting started), and then PG could go do the archiving and, ideally,
also be doing WAL replay as soon as there's enough space to allow it,
and possibly even come back online once crash recovery is done and
continue to archive WAL just like normal operation. I didn't really see
that idea addressed in your response and was hoping to get your thoughts
on it as a possible alternative approach to having a separate tool such
as this.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zheng Li 2022-05-03 21:30:20 Re: Support logical replication of DDLs
Previous Message Thomas Munro 2022-05-03 20:53:32 Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: