Re: [HACKERS] A design for amcheck heapam verification

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] A design for amcheck heapam verification
Date: 2018-03-27 21:18:18
Message-ID: CAH2-WznvLi3qbgPUeVWkLQ7wFTDwWq2YonB6dv2PeY63Ywh8Aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> + * When index-to-heap verification is requested, a Bloom filter is used to
> + * fingerprint all tuples in the target index, as the index is traversed to
> + * verify its structure. A heap scan later verifies the presence in the
> heap
> + * of all index tuples fingerprinted within the Bloom filter.
> + *
>
> Is that correct? Aren't we actually verifying the presence in the index of
> all
> heap tuples?

I think that you could describe it either way. We're verifying the
presence of heap tuples in the heap that ought to have been in the
index (that is, most of those that were fingerprinted).

As I say above the callback routine bt_tuple_present_callback(), we
blame any corruption on the heap, since that's more likely. It could
actually be the index which is corrupt, though, in rare cases where it
manages to pass the index structure tests despite being corrupt. This
general orientation comes through in the comment that you asked about.

> Can we come up with a better name for heapallindexed? May be
> "check_heapindexed"?

I don't think that that name is any better, and you're the first to
mention it in almost a year. I'd rather leave it. The fact that it's a
check is pretty clear from context.

> What I am not sure about is whether we can really examine an index which is
> valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
> this
> amcheck might be running in a transaction block, probably in a
> repeatable-read
> isolation level and hence GetTransactionSnapshot() might actually return a
> snapshot which can't yet read the index consistently. In practice, this is
> quite unlikely, but I think we should check for that case if we agree that
> it
> could be a problem.

You're right - there is a narrow window for REPEATABLE READ and
SERIALIZABLE transactions. This is a regression in v6, the version
removed the TransactionXmin test.

I am tempted to fix this by calling GetLatestSnapshot() instead of
GetTransactionSnapshot(). However, that has a problem of its own -- it
won't work in parallel mode, and we're dealing with parallel
restricted functions, not parallel unsafe functions. I don't want to
change that to fix such a narrow issue. IMV, a better fix is to treat
this as a serialization failure. Attached patch, which applies on top
of v7, shows what I mean.

I think that this bug is practically impossible to hit, because we use
the xmin from the pg_index tuple during "is index safe to use?"
indcheckxmin/TransactionXmin checks (the patch that I attached adds a
similar though distinct check), which raises another question for a
REPEATABLE READ xact. That question is: How is a REPEATABLE READ
transaction supposed to see the pg_index entry to get the index
relation's oid to call a verification function in the first place? My
point is that there is no need for a more complicated solution than
what I propose.

Note that the attached patch doesn't update the existing comments on
TransactionXmin, since I see this as a serialization error, which is a
distinct thing -- I'm not actually doing anything with
TransactionXmin.

> The case with readonly mode is also interesting. Since we're scanning heap
> with
> SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
> question is: can this happen?
>
> - some concurrent index scan sees a heaptuple as DEAD and marks the index
> entry as LP_DEAD
> - our index fingerprinting sees index tuple as LP_DEAD
> - our heap scan still sees the heaptuple as RECENTLY_DEAD
>
> Now that looks improbable given that we compute OldestXmin after our index
> fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
> DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
> OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
> there a corner case that we're missing?

I don't think so. The way we compute OldestXmin for
IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
GetOldestXmin() locks the proc array in shared mode. As you pointed
out, the fact that it comes after everything else (fingerprinting)
means that it must be equal to or later than what index scans saw,
that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
bits).

The one exception is Hot Standby mode, where it's possible for the
result of GetOldestXmin() (OldestXmin) to go backwards across
successive calls. However, that's not a problem for us because
readonly heapallindexed verification does not work in Hot Standby
mode.

> Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
> tuples, especially if those tuples were inserted/deleted by our own
> transaction? It probably worth thinking.
>
> Apart from that, comments in IndexBuildHeapRangeScan() claim that the
> function
> is called only with ShareLock and above, which is no longer true. We should
> check if that has any side-effects. I can't think of any, but better to
> verify
> and update the comments to reflect new reality,

Those comments only seem to apply in the SnapshotAny/ShareLock case,
which is what amcheck calls the readonly case. When amcheck does not
have a ShareLock, it has an AccessShareLock on the heap relation
instead, and we imitate CREATE INDEX CONCURRENTLY.

IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
ShareUpdateExclusiveLock (it just says SharedLock), because that lock
strength doesn't have anything to do with IndexBuildHeapRangeScan()
when it operates with an MVCC snapshot. I think that this means that
this patch doesn't need to update comments within
IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
independent.

> One thing that worth documenting/noting is the fact that a !readonly check
> will
> run with a long-duration registered snapshot, thus holding OldestXmin back.

The simple fact that you have a long-running statement already implies
that that'll happen, since that must have a snapshot that is at least
as old as the first snapshot that the first call to bt_check_index()
acquires. It's not a special case; it's exactly as bad as any
statement that takes the same amount of time to execute.

> Is
> there anything we can do to lessen that burden like telling other backends
> to
> ignore our xmin while computing OldestXmin (like vacuum does)?

I don't think so. The transaction involved is still an ordinary user
transaction.

--
Peter Geoghegan

Attachment Content-Type Size
0003-Defend-against-heapallindexed-using-transaction-snap.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-27 22:02:47 Re: Undesirable entries in typedefs list
Previous Message David Steele 2018-03-27 20:21:09 Re: PATCH: Configurable file mode mask