Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-18 07:19:41
Message-ID: 20211118071941.e4far2w5vufnahun@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

I spent quite a bit more time trying to have a good testcase for reproducing
this reliably. I didn't feel confident using testing that takes substantial
amounts of time to hit bugs.

On 2021-11-10 11:20:10 -0800, Andres Freund wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
>
> Consider a page like this:
>
> lp 1: redirect to lp2
> lp 2: deleted by xid x, not yet committed
>
> and a sequence of events like this:
>
> 1) heap_prune_chain(rootlp = 1)
> 2) commit x
> 3) heap_prune_chain(rootlp = 2)
>
> 1) heap_prune_chain(rootlp = 1) will go to lp2, and see a
> HEAPTUPLE_DELETE_IN_PROGRESS and thus not do anything.
>
> 3) then could, with the snapshot scalability changes, get DEAD back from
> HTSV. Due to the "fuzzy" nature of the post-snapshot-scalability xid horizons,
> that is possible, because we can end up rechecking the boundary condition and
> seeing that now the horizon allows us to prune x / lp2.
>
> At that point we have a redirect tuple pointing into an unused slot. Which is
> "illegal", because something independent can be inserted into that slot.

This above isn't quite right - it turns out this requires additional steps to
be triggerable (without debugging code to make things easier to hit). As shown
above the result won't change, because the xmin horizon of x will have held
the xmin horizon far enough back that GlobalVisTestShouldUpdate() won't ever
recompute the horizon for x on its own.

However, what *can* happen is that *another* tuple is within the "fuzzy"
horizon window, and that tuple can trigger a recomputation of the
horizon. Which then basically triggers the above problem.

However even that is very hard to hit, because normally we won't recompute the
horizon in that case either, because of

* The current heuristic is that we update only if RecentXmin has changed
* since the last update. If the oldest currently running transaction has not
* finished, it is unlikely that recomputing the horizon would be useful.

I.e. a snapshot needs to be computed between the the vacuum_set_xid_limit()
and the heap_page_prune() call. Turns out, there's a window:
vac_open_indexes() needs to lock the indexes. Which in turn will trigger
AcceptInvalidationMessages(). Relcache invalidations then can trigger a
catalog snapshot to be built, which can increase RecentXmin, and later trigger
a new accurate horizon to be built.

Consider a page like:

lp 1: redirect to lp 3
lp 2: RECENTLY_DEAD tuple xmax x1, held back by session y's xmin horizon
lp 3: DELETE_IN_PROGRESS tuple xmax x2

and the following sequence of actions:

1) vacuum_set_xid_limit()
-> sets ComputeXidHorizonsResultLastXmin = RecentXmin
2) session y commits
3) vac_open_indexes()
-> AcceptInvalidationMessages()
-> InvalidateCatalogSnapshot()
-> RelationCacheInvalidateEntry()
-> GetCatalogSnapshot()
-> GetSnapshotData()
updates RecentXmin
4) x2 commits
5) heap_prune_chain(1)
-> sees lp3 as RECENTLY_DEAD (or DELETE_IN_PROGRESS depending on timing)
-> doesn't do anything
6) heap_prune_chain(2)
-> sees a RECENTLY_DEAD, updates GlobalVisTestShouldUpdate() says yes,
updates xid horizon to above x2
7) heap_prune_chain(3)
-> uses the HeapOnly path at the beginning, sees a DEAD tuple, marks the
element as unused

Boom.

Not sure if it's worth it, but I think this can be made into a reliably
isolationtest by causing a tables index to locked exclusively, blocking
vac_open_indexes(), which can then reliably schedule a new relcache inval,
which in turn can compute a new RecentXmin. Might be too fragile to be worth
it.

As an independent issue, I don't think it can ever be safe to do catalog
access with PROC_IN_VACUUM set. Which vac_open_indexes() (and likely some
other things) clearly do. Without the current backend's xmin set, another
backend can remove still required tuples. Which can cause us to build corrupt
relcache entries, and then subsequently skip indexing some indexes, causing
corruption. I think I hit this while trying to repro the above, but it's hard
to be sure.

> I hit a crash once in 13 with a slightly evolved version of the test (many
> connections creating / dropping the partitions as in the original scenario,
> using :client_id to target different tables). It's possible that my
> instrumentation was the cause of that. Unfortunately it took quite a few hours
> to hit the problem in 13...

I wonder if this was actually the above PROC_IN_VACUUM thing.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-11-18 07:19:45 Re: BUG #17280: global-buffer-overflow on select from pg_stat_slru
Previous Message Michael Paquier 2021-11-18 06:58:18 Re: pg_upgrade test for binary compatibility of core data types