Re: Thoughts on "killed tuples" index hint bits support on standby

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on "killed tuples" index hint bits support on standby
Date: 2020-04-05 20:05:00
Message-ID: CAH2-WznaEcg90oH43J5p3JVHmLqP3XEfHOGCT-YLV53pS8X+Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Michail,

On Fri, Jan 24, 2020 at 6:16 AM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
> Some of issues your mentioned (reporting feedback to the another
> cascade standby, processing queries after restart and newer xid
> already reported) could be fixed in provided design, but your
> intention to have "independent correctness backstop" is a right thing
> to do.

Attached is a very rough POC patch of my own, which makes item
deletion occur "non-opportunistically" in unique indexes. The idea is
that we exploit the uniqueness property of unique indexes to identify
"version churn" from non-HOT updates. If any single value on a leaf
page has several duplicates, then there is a good chance that we can
safely delete some of them. It's worth going to the heap to check
whether that's safe selectively, at the point where we'd usually have
to split the page. We only have to free one or two items to avoid
splitting the page. If we can avoid splitting the page immediately, we
may well avoid splitting it indefinitely, or forever.

This approach seems to be super effective. It can leave the PK on
pgbench_accounts in pristine condition (no page splits) after many
hours with a pgbench-like workload that makes all updates non-HOT
updates. Even with many clients, and a skewed distribution. Provided
the index isn't tiny to begin with, we can always keep up with
controlling index bloat -- once the client backends themselves begin
to take active responsibility for garbage collection, rather than just
treating it as a nice-to-have. I'm pretty sure that I'm going to be
spending a lot of time developing this approach, because it really
works.

This seems fairly relevant to what you're doing. It makes almost all
index cleanup occur using the new delete infrastructure for some of
the most interesting workloads where deletion takes place in client
backends. In practice, a standby will almost be in the same position
as the primary in a workload that this approach really helps with,
since setting the LP_DEAD bit itself doesn't really need to happen (we
can go straight to deleting the items in the new deletion path).

To address the questions you've asked: I don't really like the idea of
introducing new rules around tuple visibility and WAL logging to set
more LP_DEAD bits like this at all. It seems very complicated. I
suspect that we'd be better off introducing ways of making the actual
deletes occur sooner on the primary, possibly much sooner, avoiding
any need for special infrastructure on the standby. This is probably
not limited to the special unique index case that my patch focuses on
-- we can probably push this general approach forward in a number of
different ways. I just started with unique indexes because that seemed
most promising. I have only worked on the project for a few days. I
don't really know how it will evolve.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Non-opportunistically-delete-B-Tree-items.patch application/octet-stream 11.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2020-04-05 20:06:58 Re: backup manifests and contemporaneous buildfarm failures
Previous Message Tom Lane 2020-04-05 19:52:35 Re: CLOBBER_CACHE_ALWAYS regression instability