Re: Thoughts on "killed tuples" index hint bits support on standby

From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on "killed tuples" index hint bits support on standby
Date: 2021-02-01 21:19:03
Message-ID: CANtu0ogCZveeVXmJWRO+_eWzrXX42z1LuykwgpZTCqv-5jErew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, Peter.

Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and
your comments:

> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).

It is fine to receive a page to the standby from any source: `btpo_flags`
should have some kind “LP_DEAD safe for standby” bit set to allow new bits
to be set and old - read.

> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.

As far as I can see - there is no problem here. Checksums already differ
for both heap and index pages on standby and primary. Checksums are
calculated before the page is written to the disk (not after applying FPI).
So, the masking page during *applying* the FPI is semantically the same as
setting a bit in it 1 nanosecond after.

And `btree_mask` (and other mask functions) already used for consistency
checks to exclude LP_DEAD.

> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.

I hope it is not big enough, because FPIs are not too frequent + positive
effect will easily overcome additional CPU cycles of `btree_mask` (and the
page is already in CPU cache at the moment).

> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.

I think it is fine to lose part of LP_DEAD on promotion (which can be
received only by FPI in practice). They will be set on the first scan
anyway. Also, bits set by standby may be used by newly promoted primary if
we honor OldestXmin of the previous primary while setting it (just need to
add OldestXmin in xl_running_xacts and include it into dead-horizon on
standby).

> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).

As I said before - applying the mask on *standby* will not break any
checksums - because the page is still dirty after that (and it is even
possible to call `PageSetChecksumInplace` for an additional paranoia).
Actual checksums on standby and primary already have different values (and,
probably, in the most of the pages because LP_DEAD and “classic” hint bits).

> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

Not sure I was able to understand your idea here, sorry.

> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums

Yes, and I see three major ways to implement it in the current
infrastructure:

1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts
(like here -
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com
)
3) Mask index FPI during a replay on hot standby + mark page as “primary
LP_DEAD free” in btpo_flags

Of course, each variant requires some special additional things to keep
everything safe.

As I see in SetHintsBits limitations are related to XLogNeedsFlush (check
of minRecoveryPoint in case of standby).

> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related patch -- because indexes simply don't
> have true hint bits.

Attention to minRecoveryPoint is required because of possible situation
described here -
https://www.postgresql.org/message-id/flat/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw%40mail.gmail.com#4d8ef8754b18c5e35146ed589b25bf27
The source of the potential problem - is the fact what the setting of
LP_DEAD does not change page LSN and it could cause consistency issues
during crash recovery.

Thanks,
Michail.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-02-01 21:26:23 Re: [sqlsmith] Failed assertion during partition pruning
Previous Message Robert Haas 2021-02-01 21:15:26 Re: [HACKERS] Custom compression methods