Re: Thoughts on "killed tuples" index hint bits support on standby

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on "killed tuples" index hint bits support on standby
Date: 2020-01-16 17:54:32
Message-ID: 20200116175432.jsd6pbn7cqaus2rp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-16 14:30:12 +0300, Michail Nikolaev wrote:
> First thing we need to consider - checksums and wal_log_hints are
> widely used these days. So, at any moment master could send FPW page
> with new "killed tuples" hints and overwrite hints set by standby.
> Moreover it is not possible to distinguish hints are set by primary or standby.

Note that the FPIs are only going to be sent for the first write to a
page after a checksum. I don't think you're suggesting we rely on them
for correctness (this far into the email at least), but it still seems
worthwhile to point out.

> And there is where hot_standby_feedback comes to play. Master node
> considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
> while setting "killed tuples" bits. So, if hot_standby_feedback is
> enabled on standby for a while - it could safely trust hint bits from
> master.

Well, not easily. There's no guarantee that the node it reports
hot_standby_feedback to is actually the primary. It could be an
cascading replica setup, that doesn't report hot_standby_feedback
upstream. Also hot_standby_feedback only takes effect while a streaming
connection is active, if that is even temporarily interrupted, the
primary will loose all knowledge of the standby's horizon - unless
replication slots are in use, that is.

Additionally, we also need to handle the case where the replica
currently has restarted, and is recovering using local WAL, and/or
archive based recovery. In that case the standby could already have sent
a *newer* horizon as feedback upstream, but currently again have an
older view. It is entirely possible that the standby is consistent and
queryable in such a state (if nothing forced disk writes during WAL
replay, minRecoveryLSN will not be set to something too new).

> Also, standby could set own hints using xmin it sends to primary
> during feedback (but without marking page as dirty).

We do something similar for heap hint bits already...

> Of course all is not so easy, there are a few things and corner cases
> to care about
> * Looks like RecentGlobalXmin could be moved backwards in case of new
> replica with lower xmin is connected (or by switching some replica to
> hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
> strictly forward.

I'm not sure this is a problem. If that happens we cannot rely on the
different xmin horizon anyway, because action may have been taken on the
old RecentGlobalXmin. Thus we need to be safe against that anyway.

> * hot_standby_feedback could be enabled on the fly. In such a case we
> need distinguish transactions which are safe or unsafe to deal with
> hints. Standby could receive fresh RecentGlobalXmin as response to
> feedback message. All standby transactions with xmin >=
> RecentGlobalXmin are safe to use hints.
> * hot_standby_feedback could be disabled on the fly. In such situation
> standby needs to continue to send feedback while canceling all queries
> with ignore_killed_tuples=true. Once all such queries are canceled -
> feedback are no longer needed and should be disabled.

I don't think we can rely on hot_standby_feedback at all. We can to
avoid unnecessary cancellations, etc, and even assume it's setup up
reasonably for some configurations, but there always needs to be an
independent correctness backstop.

I think it might be more promising to improve the the kill logic based
on the WAL logged horizons from the primary. All I think we need to do
is to use a more conservatively computed RecentGlobalXmin when
determining whether tuples are dead, I think. We already regularly log a
xl_running_xacts, adding information about the primaries horizon to
that, and stashing it in shared memory on the standby, shouldn't be too
hard. Then we can make a correct, albeit likely overly pessimistic,
visibility determinations about tuples, and go on to set LP_DEAD.

There are some complexities around how to avoid unnecessary query
cancellations. We'd not want to trigger recovery conflicts based on the
new xl_running_xacts field, as that'd make the conflict rate go through
the roof - but I think we could safely use the logical minimum of the
local RecentGlobalXmin, and the primaries'.

That should allow us to set additional LP_DEAD safely, I believe. We
could even rely on those LP_DEAD bits. But:

I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
skip over entries during scans, however. The bits set as described above
would be safe, but we also can see LP_DEAD set by the primary (and even
upstream cascading standbys at least in case of new base backups taken
from them), due to them being not being WAL logged. As we don't WAL log,
there is no conflict associated with the LP_DEADs being set. My gut
feeling is that it's going to be very hard to get around this, without
adding WAL logging for _bt_killitems et al (including an interface for
kill_prior_tuple to report the used horizon to the index).

I'm wondering if we could recycle BTPageOpaqueData.xact to store the
horizon applying to killed tuples on the page. We don't need to store
the level for leaf pages, because we have BTP_LEAF, so we could make
space for that (potentially signalled by a new BTP flag). Obviously we
have to be careful with storing xids in the index, due to potential
wraparound danger - but I think such page would have to be vacuumed
anyway, before a potential wraparound. I think we could safely unset
the xid during nbtree single page cleanup, and vacuum, by making sure no
LP_DEAD entries survive, and by including the horizon in the generated
WAL record.

That however still doesn't really fully allow us to set LP_DEAD on
standbys, however - but it'd allow us to take the primary's LP_DEADs
into account on a standby. I think we'd run into torn page issues, if we
were to do so without WAL logging, because we'd rely on the LP_DEAD bits
and BTPageOpaqueData.xact to be in sync. I *think* we might be safe to
do so *iff* the page's LSN indicates that there has been a WAL record
covering it since the last redo location.

I only had my first cup of coffee for the day, so I might also just be
entirely off base here.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-01-16 18:08:16 Re: psql - add SHOW_ALL_RESULTS option
Previous Message Tomas Vondra 2020-01-16 17:48:21 Re: psql - add SHOW_ALL_RESULTS option