Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>
Subject: Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)
Date: 2022-04-18 15:18:00
Message-ID: 20220418151800.GL10577@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greeting,

* Bharath Rupireddy (bharath(dot)rupireddyforpostgres(at)gmail(dot)com) wrote:
> On Mon, Apr 18, 2022 at 7:41 PM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > * Bharath Rupireddy (bharath(dot)rupireddyforpostgres(at)gmail(dot)com) wrote:
> > > Thanks for the comments. Here's a new tool called pg_walcleaner which
> > > basically deletes (optionally archiving before deletion) the unneeded
> > > WAL files.
> > >
> > > Please provide your thoughts and review the patches.
> >
> > Alright, I spent some more time thinking about this and contemplating
> > what the next steps are... and I feel like the next step is basically
> > "add a HINT when the server can't start due to being out of disk space
> > that one should consider running pg_walcleaner" and at that point... why
> > aren't we just, uh, doing that? This is all still quite hand-wavy, but
> > it sure would be nice to be able to avoid downtime due to a broken
> > archiving setup. pgbackrest has a way of doing this and while we, of
> > course, discourage the use of that option, as it means throwing away
> > WAL, it's an option that users have. PG could have a similar option.
> > Basically, to archive_command/library what max_slot_wal_keep_size is for
> > slots.
>
> Thanks. I get your point. The way I see it is that the postgres should
> be self-aware of the about-to-get-full disk (probably when the data
> directory size is 90%(configurable, of course) of total disk size) and
> then freeze the new write operations (may be via new ALTER SYSTEM SET
> READ-ONLY or setting default_transaction_read_only GUC) and then go
> clean the unneeded WAL files by just invoking pg_walcleaner tool
> perhaps. I think, so far, this kind of work has been done outside of
> postgres. Even then, we might get into out-of-disk situations
> depending on how frequently we check the data directory size to
> compute the 90% configurable limit. Detecting the disk size is the KEY
> here. Hence we need an offline invokable tool like pg_walcleaner.

Ugh, last I checked, figuring out if a given filesystem is near being
full is a pain to do in a cross-platform way. Why not just do exactly
what we already are doing for replication slots, but for
archive_command? Then we wouldn't need to go into a read-only mode.
Perhaps going into a read-only mode would be an alternative option to
that but we should definitely be letting the admin pick what to do in
such a case. The idea of going read-only and then *also* removing WAL
files doesn't seem like it's ever the right choice though..?

As for worrying about frequency.. that seems unlikely to be that serious
of an issue, if we just check how far behind we are with each time we
try to run archive_command. That's basically how the pgbackrest option
works and we've had few issues with it not being 'soon enough'.

> Actually, I was planning to write an extension with a background
> worker doing this for us.

... but we have a background worker already for archiving that could
handle this for us, doesn't seem like we need another.

> > That isn't to say that we shouldn't also have a tool like this, but it
> > generally feels like we're taking a reactive approach here rather than a
> > proactive one to addressing the root issue.
>
> Agree. The offline tool like pg_walcleaner can help greatly even with
> some sort of above internal/external disk space monitoring tools.

See, this seems like a really bad idea to me. I'd be very concerned
about people mis-using this tool in some way and automating its usage
strikes me as absolutely exactly that.. Are we sure that we can
guarantee that we don't remove things we shouldn't when this ends up
getting run against a running cluster from someone's automated tooling?
Or when someone sees that it refuses to run for $reason and tries to..
"fix" that? Seems quite risky to me.. I'd probably want to put similar
caveats around using this tool as I do around pg_resetwal when doing
training- that is, don't ever, ever, ever use this, heh.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-04-18 15:26:02 Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)
Previous Message Robert Haas 2022-04-18 15:14:48 Re: TRAP: FailedAssertion("HaveRegisteredOrActiveSnapshot()", File: "toast_internals.c", Line: 670, PID: 19403)