Re: Catalog invalidations vs catalog scans vs ScanPgRelation()

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Catalog invalidations vs catalog scans vs ScanPgRelation()
Date: 2020-02-29 20:17:07
Message-ID: 20200229201707.jvmau4x5nof2dywx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-02-28 22:10:52 -0800, Andres Freund wrote:
> So, um. What happens is that doDeletion() does a catalog scan, which
> sets a snapshot. The results of that catalog scan are then used to
> perform modifications. But at that point there's no guarantee that we
> still hold *any* snapshot, as e.g. invalidations can trigger the catalog
> snapshot being released.
>
> I don't see how that's safe. Without ->xmin preventing that,
> intermediate row versions that we did look up could just get vacuumed
> away, and replaced with a different row. That does seem like a serious
> issue?
>
> I think there's likely a lot of places that can hit this? What makes it
> safe for InvalidateCatalogSnapshot()->SnapshotResetXmin() to release
> ->xmin when there previously has been *any* catalog access? Because in
> contrast to normal table modifications, there's currently nothing at all
> forcing us to hold a snapshot between catalog lookups an their
> modifications?
>
> Am I missing something? Or is this a fairly significant hole in our
> arrangements?

I still think that's true. In a first iteration I hacked around the
problem by explicitly registering a catalog snapshot in
RemoveTempRelations(). That *sometimes* allows to get through the
regression tests without the assertions triggering.

But I don't think that's good enough (even if we fixed the other
potential crashes similarly). The only reason that avoids the asserts is
because in nearly all other cases there's also a user snapshot that's
pushed. But that pushed snapshot can have an xmin that's newer than the
catalog snapshot, which means we're still in danger of tids from catalog
scans being outdated.

My preliminary conclusion is that it's simply not safe to do
SnapshotResetXmin() from within InvalidateCatalogSnapshot(),
PopActiveSnapshot(), UnregisterSnapshotFromOwner() etc. Instead we need
to defer the SnapshotResetXmin() call until at least
CommitTransactionCommand()? Outside of that there ought (with exception
of multi-transaction commands, but they have to be careful anyway) to be
no "in progress" sequences of related catalog lookups/modifications.

Alternatively we could ensure that all catalog lookup/mod sequences
ensure that the first catalog snapshot is registered. But that seems
like a gargantuan task?

> The easiest way to fix this would probably be to have inval.c call a
> version of InvalidateCatalogSnapshot() that leaves the oldest catalog
> snapshot around, but sets up things so that GetCatalogSnapshot() will
> return a freshly taken snapshot? ISTM that pretty much every
> InvalidateCatalogSnapshot() call within a transaction needs that behaviour?

A related question is in which cases is it actually safe to use a
snapshot that's not registered, nor pushed as the active snapshot.
snapmgr.c just provides:

* Note that the return value may point at static storage that will be modified
* by future calls and by CommandCounterIncrement(). Callers should call
* RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
* used very long.

but doesn't clarify what 'very long' means. As far as I can tell,
there's very little that actually safe. It's probably ok to do a single
visiblity test, but anything that e.g. has a chance of accepting
invalidations is entirely unsafe?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-02-29 20:55:18 Re: SLRU statistics
Previous Message Justin Pryzby 2020-02-29 19:59:42 Re: vacuum verbose detail logs are unclear; log at *start* of each stage