Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Hammond <andrew(dot)george(dot)hammond(at)gmail(dot)com>
Subject: Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified
Date: 2011-10-26 03:24:33
Message-ID: 4177.1319599473@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ordinarily, sending out sinval messages post-commit is okay because we
>> don't release locks until after that, and we suppose that our locks
>> prevent any other transactions from getting to the point of using
>> syscache entries that might have been invalidated by our transaction.
>> However, *we have carefully hacked on ANALYZE until it doesn't take any
>> locks that would block concurrent queries on the analyzed table.* So
>> the normal protection against stale syscache entries simply doesn't
>> work for pg_statistic fetches.

> This is very similar to one of the issues that reared its ugly head in
> regards to Simon's now-reverted patch to lower DDL locking strength.
> You identified some other issues there as well, but *one* of the
> issues was that, as in this case, the sinval mechanism fails to
> provide the necessary synchronization guarantees unless the lock
> required to reread the updated data conflicts with the lock required
> to change the data.

Right. We may take as little as AccessShareLock on a relation before
examining its pg_statistic entries, and ANALYZE isn't taking anything
that would block that.

> So I'm wondering if we ought to rethink our position that users of the
> sinval machinery must provide their own external synchronization
> through heavyweight locking, and instead build the synchronization
> into the sinval mechanism itself.

Yeah, it's starting to feel like we need a basic redesign of sinval
... although I'd not care to back-patch that, so we also need to think
of a sane solution for the back branches.

> If we sent out sinval messages just before removing ourselves from the
> ProcArray, I think that would more-or-less fix this bug (although
> maybe I'm missing some reason why it's not practical to send them that
> early) except that I don't see any way to handle the sinval-reset
> case, which seems to more or less kill this idea in its tracks.

The other reason that doesn't work is there's a race condition: someone
might load their cache entry immediately after the sinval message went
past, but before the updating transaction commits.

> But maybe there's some other mechanism whereby we could combine
> sending the sinval messages slightly earlier (before
> ProcArrayEndTransaction) with blocking anyone who processes those
> messages until after the committing backend finishes
> ProcArrayEndTransaction. For example, you could add an additional
> LWLock, which has to be held in exclusive mode by a committing
> transaction that sends any sinval messages.

Doesn't sound very scalable :-(.

Even given your recent changes to reduce the overhead of checking for
sinval messages, I'm not sure that it'd be practical to move the sinval
message processing to just-before-we-look-up-a-cache-entry. Right now,
we do AcceptInvalidationMessages basically once per table per query
(or maybe it's twice or so, but anyway a very small multiplier on that).
If we try to do it every time through SearchSysCache, we are probably
talking two to three orders of magnitude more checks, which ISTM is
certain to push the sinval queue back up to the top of the heap for
contention.

But in any case, this isn't the core of the problem. The real point
here is that we need a guarantee that a syscache entry we're going to
use is/was valid as of some suitable time point later than the start of
our current transaction. (Once we have taken a snapshot, VACUUM will
know that it can't remove any tuples that were deleted after the time of
that snapshot; so even for SnapshotNow fetches, it's important to have
an MVCC snapshot to protect toast-table dereferences.) Perhaps rather
than tying the problem into SearchSysCache, we should attach the
overhead to GetTransactionSnapshot, which is called appealingly few
times per query. But right offhand it seems like that only protects us
against the toast-tuple-deletion problem, not against the more general
one of getting a stale view of the status of some relation.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-10-26 06:53:46 Re: Range Types - typo + NULL string constructor
Previous Message Robert Haas 2011-10-26 02:33:06 Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified