[PATCH] Full support for index LP_DEAD hint bits on standby

From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Full support for index LP_DEAD hint bits on standby
Date: 2021-01-18 20:30:21
Message-ID: CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7=9Y92+US_WHGOoQevg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, hackers.

[ABSTRACT]

Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:

* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.

Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.

The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.

This thread is the continuation (and party copy-paste) of the old
previous one [4].

[PROBLEM]

The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.

So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.

The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...

But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].

Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.

[DETAILS]

The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.

`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).

Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).

New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).

There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.

It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.

What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).

While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.

To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.

So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.

And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs
*heap* hint bits to be already set at standby. So, the same code
decides to set hint bits on the heap (it is done already on standby
for a long time) and in the index.

[EVALUATION]
It is not possible to find an ideal performance test for such kind of
optimization.

But there is a possible example in the attachment. It is a standard
pgbench schema with an additional index on balance and random balance
values.

On primary test do next:
1) transfer some money from one random of the top 100 rich accounts to
one random of the top 100 poor accounts.
2) calculate the amount of money in the top 10 rich and top 10 poor
accounts (and include an additional field to avoid index-only-scan).
In the case of standby only step 2 is used.

The patched version is about 9x faster for standby queries - like 455
TPS versus 4192 TPS on my system. There is no visible difference for
primary.

To estimate the additional amount of WAL logs, I have checked records
in WAL-segments during different conditions:
(pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l)

- hot_standby_feedback=off - 5181 of 226274 records ~2%
- hot_standby_feedback=on (without load on standby) - 70 of 202594
records ~ 0.03%
- hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02%

So, with HSF=on (which is the default value) WAL increase is not
significant. Also, for HSF=off it should be possible to radically
reduce the number of additional WAL logs by using `latestRemovedXid`
from other records (like Heap2/CLEAN) in "send only newer xid"
optimization (I have skipped it for now for simplicity).

[CONCLUSION]

The only thing we pay – a few additional WAL records and some
additional moderate code complexity. But the support of hint-bits on
standby is a huge advantage for many workloads. I was able to get more
than a 900% performance boost (and it is not surprising – index hint
bits are just great optimization). And it works for almost all index
types out of the box.

Another major thing here – everything is based on old, well-tested
mechanics: query cancelation because of snapshot conflicts, setting
heap hint bits on standby, hot standby feedback.

[REFERENCES]

[1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
[4] - https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/flat/CANtu0oigC0%2BH0UkxktyovdLLU67ikM0%2BDw3J4EQqiDDeGhcwsQ%40mail.gmail.com

Attachment Content-Type Size
code.patch text/x-patch 50.1 KB
docs.patch text/x-patch 5.5 KB
test.patch text/x-patch 10.6 KB
pefttest.tar.gz application/gzip 697 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-01-18 20:43:41 Re: Deleting older versions in unique indexes to avoid page splits
Previous Message Fabien COELHO 2021-01-18 20:29:53 Re: PG vs LLVM 12 on seawasp, next round