Re: Thoughts on "killed tuples" index hint bits support on standby

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on "killed tuples" index hint bits support on standby
Date: 2021-01-31 01:39:10
Message-ID: CAH2-Wzn+9+vKmC=fG2ocAeVC0S6U1DxJ5y=z0bfRi6pMEjN-ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 30, 2021 at 9:11 AM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
> > Yeah, it would help a lot. But those bits are precious. So it makes
> > sense to think about what to do with both of them in index AMs at the
> > same time. Otherwise we risk missing some important opportunity.
>
> Hm. I was trying to "expand the scope" as you said and got an idea... Why we even should do any conflict resolution for hint bits? Or share precious LP bits?

What does it mean today when an LP_DEAD bit is set on a standby? I
don't think that it means nothing at all -- at least not if you think
about it in the most general terms.

In a way it actually means something like "recently dead" even today,
at least in one narrow specific sense: recovery may end, and then we
actually *will* do *something* with the LP_DEAD bit, without directly
concerning ourselves with when or how each LP_DEAD bit was set.

During recovery, we will probably always have to consider the
possibility that LP_DEAD bits that get set on the primary may be
received by a replica through some implementation detail (e.g. LP_DEAD
bits are set in FPIs we replay, maybe even some other thing that
neither of us have thought of). We can't really mask LP_DEAD bits from
the primary in recovery anyway, because of stuff like page-level
checksums. I suspect that it just isn't useful to fight against that.
These preexisting assumptions are baked into everything already.

Why should we assume that *anybody* understands all of the ways in
which that is true?

Even today, "LP_DEAD actually means a limited kind of 'recently dead'
during recovery + hot standby" is something that is true (as I just
went into), but at the same time also has a fuzzy definition. My gut
instinct is to be conservative about that. As I suggested earlier, you
could invent some distinct kind of "recently dead" that achieves your
goals in a way that is 100% orthogonal.

The two unused LP dead bits (unused in indexes, though not tables) are
only precious if we assume that somebody will eventually use them for
something -- if nobody ever uses them then they're 100% useless. The
number of possible projects that might end up using the two bits for
something is not that high -- certainly not more than 3 or 4. Besides,
it is always a good idea to keep the on-disk format as simple and
explicit as possible -- it should be easy to analyze forensically in
the event of bugs or some other kind of corruption, especially by
using tools like amcheck.

As I said in my earlier email, we can even play tricks during page
deletion by treating certain kinds of "recently dead" bits as
approximate things. As a further example, we can "rely" on the
"dead-to-all but only on standby" bits when recovery ends, during a
subsequent write transactions. We can do so by simply using them in
_bt_deadblocks() as if they were LP_DEAD (we'll recheck heap tuples in
heapam.c instead).

> The only way standby could get an "invalid" hint bit - is an FPI from the primary. We could just use the current *btree_mask* infrastructure and clear all "probably invalid" bits from primary in case of standby while applying FPI (in `XLogReadBufferForRedoExtended`)!

I don't like that idea. Apart from what I said already, you're
assuming that setting LP_DEAD bits in indexes on the primary won't
eventually have some value on the standby after it is promoted and can
accept writes -- they really do have meaning and value on standbys.
Plus you'll introduce new overhead for this process during replay,
which creates significant overhead -- often most leaf pages have some
LP_DEAD bits set during recovery.

> For binary compatibility, we could use one of `btpo_flags` bits to mark the page as "primary bits masked". The same way would work for hash\gist too.

I agree that there are plenty of btpo_flags bits. However, I have my
doubts about this.

Why shouldn't this break page-level checksums (or wal_log_hints) in
some way? What about pg_rewind, some eventual implementation of
incremental backups, etc? I suspect that it will be necessary to
invent some entirely new concept that is like a hint bit, but works on
standbys (without breaking anything else).

> No conflicts, only LP_DEAD bit is used by standby, `btpo_flags` has many free bits for now, easy to implement, page content of primary\standby already differs in this bits...
> Looks like an easy and effective solution for me.

Note that the BTP_HAS_GARBAGE flag (which goes in btpo_flags) was
deprecated in commit cf2acaf4. It was never a reliable indicator of
whether or not some LP_DEAD bits are set in the page. And I think that
it never could be made to work like that.

> >> Also, btw, do you know any reason to keep minRecoveryPoint at a low value?
> > Not offhand.
>
> If so, looks like it is not a bad idea to move minRecoveryPoint forward from time to time (for more aggressive standby index hint bits). For each `xl_running_xacts` (about each 15s), for example.

It's necessary for recoverypoints (i.e. the standby equivalent of a
checkpoint) to do that in order to ensure that we won't break
checksums. This whole area is scary to me:

https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com

Since, as I said, it's already true that LP_DEAD bits on standbys are
some particular kind of "recently dead bit", even today, any design
that uses LP_DEAD bits in some novel new way (also on the standby) is
very hard to test. It might easily have very subtle bugs -- obviously
a recently dead bit relates to a tuple pointing to a logical row that
is bound to become dead soon enough. The difference between dead and
recently dead is already blurred, and I would rather not go there.

If you invent some entirely new category of standby-only hint bit at a
level below the access method code, then you can use it inside access
method code such as nbtree. Maybe you don't have to play games with
minRecoveryPoint in code like the "if (RecoveryInProgress())" path
from the XLogNeedsFlush() function. Maybe you can do some kind of
rudimentary "masking" for the in recovery case at the point that we
*write out* a buffer (*not* at the point hint bits are initially set)
-- maybe this could happen near to or inside FlushBuffer(), and maybe
only when checksums are enabled? I'm unsure.

> > BTW, what happens when the page splits on the primary, btw? Does your
> > patch "move over" the LP_DEAD bits to each half of the split?
>
> That part is not changed in any way.

Maybe it's okay to assume that it's no loss to throw away hint bits
set on the standby, because in practice deletion on the primary will
usually do the right thing anyway. But you never said that. I think
that you should take an explicit position on this question -- make it
a formal part of your overall design.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-01-31 05:07:01 Re: shared tempfile was not removed on statement_timeout
Previous Message Thomas Munro 2021-01-31 01:26:31 Re: shared tempfile was not removed on statement_timeout