Re: Enhance traceability of wal_level changes for backup management

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: 'David Steele' <david(at)pgmasters(dot)net>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, 'Masahiko Sawada' <sawada(dot)mshk(at)gmail(dot)com>, "'pgsql-hackers(at)lists(dot)postgresql(dot)org'" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Enhance traceability of wal_level changes for backup management
Date: 2021-03-15 21:32:29
Message-ID: 20210315213229.GA20766@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* tsunakawa(dot)takay(at)fujitsu(dot)com (tsunakawa(dot)takay(at)fujitsu(dot)com) wrote:
> From: David Steele <david(at)pgmasters(dot)net>
> > As a backup software author, I don't see this feature as very useful.
> >
> > The problem is that there are lots of ways for WAL to go missing so
> > monitoring the WAL archive for gaps is essential and this feature would
> > not replace that requirement. The only extra information you'd get is
> > the ability to classify the most recent gap as "intentional", maybe.
>
> But how do you know there's any missing WAL? I think there are the following cases of missing WAL:

> 1. A WAL segment file is missing. e.g., 00000001 and 00000003 exist, but 00000002 doesn't.
>
> 2. All consecutive WAL segment files appear to exist, but some WAL records are missing.
> This occurs ?only? when some WAL-optimized statements are run while wal_level = minimal.
>
> Currently, backup management tools can detect 1 by scanning through the WAL archive directory. But the can't notice 2. The patch addresses this.

They could notice #2 also by scanning the WAL, but that's certainly a
lot more work than just looking in pg_control.

* Peter Eisentraut (peter(dot)eisentraut(at)enterprisedb(dot)com) wrote:
> On 08.03.21 03:45, osumi(dot)takamichi(at)fujitsu(dot)com wrote:
> >OK. The basic idea is to enable backup management
> >tools to recognize wal_level drop between*snapshots*.
> >When you have a snapshot of the cluster at one time and another one
> >at different time, with this new parameter, you can see if
> >anything that causes discontinuity from the drop happens
> >in the middle of the two snapshots without efforts to have a look at the WALs in between.
>
> Is this an actual problem? Changing wal_level requires a restart. Are
> users frequently restarting their servers to change wal_level and then
> wonder why their backups are misbehaving or incomplete? Why? Just like
> fsync is "breaks your database", wal_level might as well be "breaks your
> backups". Is it not documented well enough?

We explicitly document that people can switch the WAL level and restart
to do bulk data loads faster, and there's certainly no shortage of
discussion (including what prompted this thread..) about doing exactly
that. Adding more documentation around that would certainly be good,
as would changing this:

ereport(WARNING,
(errmsg("WAL was generated with wal_level=minimal, data may be missing"),
errhint("This happens if you temporarily set wal_level=minimal without taking a new base backup.")));

into a PANIC instead of a WARNING. It's simply far too easy to end up
with corruption in the system when doing PITR through a period of time
when the WAL level was set to minimal. Unfortunately, if the user
didn't happen to know that they needed to take a new full backup after
flipping to minimal and back then they could end up with corruption at
restore/replay time which is certainly not when you want anything to go
wrong. If it was available in the control file then we could more
proactively make noise at the user to take a new full backup.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-03-15 21:35:10 pgsql: Add libpq pipeline mode support to pgbench
Previous Message Jim Finnerty 2021-03-15 21:31:52 Nondeterministic collations and the value returned by GROUP BY x