Quick Links

Computing the conflict xid for index page-level-vacuum on primary

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject:	Computing the conflict xid for index page-level-vacuum on primary
Date:	2018-12-14 01:42:35
Message-ID:	20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Currently nbtree and hash indexes (and soon gist, where it's missing due
to oversight) compute an xid that is used to resolve recovery conflicts.
They do so by visiting all the heap pages
xl_btree_delete/xl_hash_vacuum_one_page point to item-by-item.

That's problematic for two projects I'm working on:

1) For pluggable storage it's problematic, because obviously we can't
have the nbtree/hash/gist code just assume that the index points to a
heap. There's no way we can do catalog lookups to determine the AM,
and additionally AMs are database specific, so the startup process
couldn't handle that, even if we somehow managed to get the oid or
such of the AM handler.

The zheap patchset currently solved that by having a separate flag
indicating that the index is over zheap, but that's obviously doesn't
work with table storage inteded to be pluggable.

2) For logical decoding on the standby [3], at least in my approach, we
need to be able to recognize snapshot conflicts not just when the
database is consistent, but also when it's not yet. But
unfortunately, the current approach requires that a consistent state
has been reached:

/*
* In what follows, we have to examine the previous state of the index
* page, as well as the heap page(s) it points to. This is only valid if
* WAL replay has reached a consistent database state; which means that
* the preceding check is not just an optimization, but is *necessary*. We
* won't have let in any user sessions before we reach consistency.
*/
if (!reachedConsistency)
elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");

I think there's also the tertiary issue that doing substantial work
during recovery that's not performed during the primary is a bad idea,
because it makes it much more likely for the standby to start lagging
behind. Recovery is single-threaded, whereas records like
xl_btree_delete can be emitted concurrently by every single backend. In
the current way there's no backpressure whatsoever. This also means
that we do that work in multiple replicas.

This leads me to think that we possibly should move computation of the
last removed xid from recovery to the primary, during the generation of
the xl_btree_delete WAL record.

That obviously has the downside that we're doing the additional work
whenever wal_level >= replica, which previously only was done during
recovery. I don't really see a good way around that for pluggable
storage, unfortunately - I'd be delighted to hear a good one. Both
during recovery and on the primary that works happens while the index
page is locked.

In the prototype I've attached, I've attempted to address that by making
the routine less inefficient than before:
- Previously it fetched a buffer for every single tuple, in the order of
items on the index page. It's pretty common for there to be multiple
tuples on the same page, and it's more efficient to access pages in
the on-disk order.
- We didn't do any prefetching, which seems a waste given the ease that
can be done in this case. I've for now just added unconditional
prefetching, but I guess that ought to instead be something respective
effective_io_concurrency :/
- It seems to me that it's more likely on the primary that the relevant
heap pages are already cached: The kill_prior_tuple logic needs to
have kicked in to start the page level deletion after all, and that
doesn't WAL log any changes to the heap page.

While on my laptop I wasn't able to observe a performance regression due
to these changes, but was able to see recovery performance increase, I
suspect it's possible to see performance regressions on disks that are
less fast than my SSD, in particular on rotational disks.

Talking with Peter Geoghegan we were pondering a few ideas to address
that:
- we could stop doing page level deletion after the tuples on the first
heap page, if that frees sufficient space on the index page, to stave
of a page split for longer. That has the downside that it's possible
that we'd end up WAL logging more, because we'd repeatedly emit
xl_btree_delete records.
- We could use the visits to the heap pages to be *more* aggressive
about deleting the heap page. If there's some tuples on a heap page
that are deletable, and we know about that due to the kill_prior_tuple
logic, it's fairly likely that other pointers to the same page
actually are also dead: We could re-check that for all index entries
pointing to pages that we'd visit anyway.
- We could invent a 'exclusive read' lwlock level, and downgrade the
index buffer lock while we check the heap pages, and then upgrade
afterwards. A cursory look makes that look safe, but it'd obviously
be a nontrivial change.

Comments? Better ideas? Pitchforks?

Minor aside:

The code currently has these two comments:
* If there's nothing running on the standby we don't need to derive a
* full latestRemovedXid value, so use a fast path out of here. This
* returns InvalidTransactionId, and so will conflict with all HS
* transactions; but since we just worked out that that's zero people,
* it's OK.
*
* XXX There is a race condition here, which is that a new backend might
* start just after we look. If so, it cannot need to conflict, but this
* coding will result in throwing a conflict anyway.
if (CountDBBackends(InvalidOid) == 0)
return latestRemovedXid;
and
/*
* If all heap tuples were LP_DEAD then we will be returning
* InvalidTransactionId here, which avoids conflicts. This matches
* existing logic which assumes that LP_DEAD tuples must already be older
* than the latestRemovedXid on the cleanup record that set them as
* LP_DEAD, hence must already have generated a conflict.
*/
return latestRemovedXid;

which contradict each other.

ResolveRecoveryConflictWithSnapshot indeed ignores such conflicts:

/*
* If we get passed InvalidTransactionId then we are a little surprised,
* but it is theoretically possible in normal running. It also happens
* when replaying already applied WAL records after a standby crash or
* restart, or when replaying an XLOG_HEAP2_VISIBLE record that marks as
* frozen a page which was already all-visible. If latestRemovedXid is
* invalid then there is no conflict. That rule applies across all record
* types that suffer from this conflict.
*/
if (!TransactionIdIsValid(latestRemovedXid))
return;

which means we don't handle the race condition above correctly, as far
as I can tell? This seems to have been introduced in [2]. It's
probably not too easy to hit this, because newer transactions on a
standby are likely to have a new enough snapshot, but I think it's
a question of likelihood, not certainty.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=52010027efc8757fdd830a4d0113763a501259bc
[3] https://www.postgresql.org/message-id/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

Attachment	Content-Type	Size
index-page-vacuum-xid-horizon-primary.diff	text/x-diff	21.5 KB

Responses

Re: Computing the conflict xid for index page-level-vacuum on primary at 2018-12-14 18:39:48 from Alexander Korotkov
Re: Computing the conflict xid for index page-level-vacuum on primary at 2018-12-15 01:50:56 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2018-12-14 01:50:30	Valgrind failures in Apply Launcher's bgworker_quickdie() exit
Previous Message	Matsumura, Ryo	2018-12-14 01:42:19	RE: [PROPOSAL]a new data type 'bytea' for ECPG