Quick Links

Re: Online checksums patch - once again

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, Magnus Hagander <magnus(at)hagander(dot)net>, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Online checksums patch - once again
Date:	2020-01-26 21:03:03
Message-ID:	20200126210303.2plp4fivproetsqm@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-01-23 12:23:09 -0500, Robert Haas wrote:
> On Thu, Jan 23, 2020 at 6:19 AM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
> > A bigger question is how to handle the offline capabilities. pg_checksums can
> > enable or disable checksums in an offline cluster, which will put the cluster
> > in a state where the pg_control file and the catalog don't match at startup.
> > One strategy could be to always trust the pg_control file and alter the catalog
> > accordingly, but that still leaves a window of inconsistent cluster state.
>
> I suggest that we define things so that the catalog state is only
> meaningful during a state transition. That is, suppose the cluster
> state is either "on", "enabling", or "off". When it's "on", checksums
> are written and verified. When it is "off", checksums are not written
> and not verified. When it's "enabling", checksums are written but not
> verified. Also, when and only when the state is "enabling", the
> background workers that try to rewrite relations to add checksums run,
> and those workers look at the catalog state to figure out what to do.
> Once the state changes to "on", those workers don't run any more, and
> so the catalog state does not make any difference.
>
> A tricky problem is to handling the case where the state is switched
> from "enabling" to "on" and then back to "off" and then to "enabling"
> again. You don't want to confuse the state from the previous round of
> enabling with the state for the current round of enabling. Suppose in
> addition to storing the cluster-wide state of on/off/enabling, we also
> store an "enable counter" which is incremented every time the state
> goes from "off" to "enabling". Then, for each database and relation,
> we store a counter that indicates the value of the enable counter at
> the time we last scanned/rewrote that relation to set checksums. Now,
> you're covered. And, to save space, it can probably be a 32-bit
> counter, since 4 billion disable/reenable cycles ought to be enough
> for anybody.
>
> It would not be strictly necessary to store this in pg_class. Another
> thing that could be done is to store it in a separate system table
> that could even be truncated when enabling is not in progress - though
> it would be unwise to assume that it's always truncated at the
> beginning of an enabling cycle, since it would be hard to guarantee
> that the previous enabling cycle didn't fail when trying to truncate.
> So you'd probably still end up with something like the counter
> approach. I am inclined to think that inventing a whole new catalog
> for this is over-engineering, but someone might think differently.
> Note that creating a table while enabling is in progress needs to set
> the enabling counter for the new table to the new value of the
> enabling counter, not the old one, because the new table starts empty
> and won't end up with any pages that don't have valid checksums.
> Similarly, TRUNCATE, CLUSTER, VACUUM FULL, and rewriting variants of
> ALTER TABLE can set the new value for the enabling counter as a side
> effect. That's probably easier and more efficient if it's just value
> in pg_class than if they have to go poking around in another catalog.
> So I am tentatively inclined to think that just putting it in pg_class
> makes more sense.

I'm somewhat inclined to think that it's worth first making this robust
without catalog state - even though I find restartability
important. Especially due to not having convenient ways to have cross
database state that we can reset without again needing background
workers. I also wonder if it's not worthwhile to design the feature in a
way that, *in the future*, checksums could be separately set on the
standby/primary - without needing to ship the whole database through
WAL.

Oh, if all relation types had a metapage with a common header, this
would be so much easier...

It'd also be a lot easier if we could map from relfilenode back to a
relation oid, without needing catalog access. That'd allow us to acquire
locks on the relation for a filenode, without needing to be connected to
a database. Again, with a metapage, that'd be quite doable.

Probably not worth doing just for this, but I'm wondering about solving
the metapage issue by just adding a metadata relation fork. Sucks to
increase the number of files further, but I don't really see a path
towards having a metapage for everything, given pg_upgrade compat
requirements. Such a metadata fork, in contrast, could easily be filled
by pg_upgrade. That metadata file could perhaps replace init forks too.

For debuggability storing some information about the relation in that
metadata fork would be great. Being able to identify the type of
relation etc from there, and perhaps even the relname at creation, would
certainly be helpful for cases the database doesn't start up anymore.

With a bit of care, we could allow AMs to store additional information
in there, by having a offset pointer for am information in the common
header.

E.g. for tables it'd be feasible to have the types of columns in there
(since it's associated with a relfilenode, rather than relation, there's
no problem with rewrites), allowing to correctly interpret data without
catalog access when shit has hit the fan.

Greetings,

Andres Freund

In response to

Re: Online checksums patch - once again at 2020-01-23 17:23:09 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2020-01-26 21:24:46	Re: EXPLAIN's handling of output-a-field-or-not decisions
Previous Message	Andres Freund	2020-01-26 20:30:59	Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?