Re: brin index vacuum versus transaction snapshots

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: brin index vacuum versus transaction snapshots
Date: 2015-07-31 17:57:00
Message-ID: 20150731175700.GX2441@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Let me explain a bit more what's the mechanism at play.

When scanning a block range to summarize it, we use an MVCC snapshot.
Any tuples that are inserted by transactions that appear as in-progress
to that snapshot will not be seen by that range scan; therefore we need
another mechanism to include their values in the summary tuple.

This mechanism is a "placeholder tuple." We insert this tuple into the
index; any process that tries to insert into the range will update the
placeholder tuple. All insertions that are not visible according to the
MVCC scan must update the placeholder tuple.

So there are two events that are time-critical regarding each other: the
insertion of the placeholder tuple, and the acquisition of the snapshot
for the MVCC scan. All transactions in progress for the snapshot *must*
see the placeholder tuple; therefore, we make sure we insert the
placeholder tuple first, and acquire the snapshot later.

This is all regardless of a transaction being a user-declared
transaction block, or an implicit transaction opened by some command;
regardless of vacuum being user-invoked or autovacuum. The description
of the above is critical for correctness in all those cases.

I assumed (wrongly) that vacuum would not acquire a snapshot. (For one
thing, we already rely for correctness on lazy vacuum not holding xmin
back, so this is not entirely without sense.) The problem is that
PortalRunUtility does it inconditionally. This is not a problem in
read-committed mode, because we simply freshen the snapshot each time
and all is well. In transaction-snapshot mode, taking another snapshot
is a no-op and thus the whole thing is broken. The code is right to
raise an error. (I don't really give a damn whether the error message
is 100% correct or not. Fixing that part is trivial.)

Even assuming that PortalRunUtility would not grab a snapshot at start,
things would still be broken if vacuum processed more than one range in
transaction-snapshot isolation mode: insert placeholder tuple, grab
snapshot for the first range, scan range: all is well up to this point.
Then we need to process the second range: insert placeholder tuple, grab
snapshot ... which reuses the previous snapshot, older than the
placeholder tuple. Oops.

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Untested patch attached.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
brin-snapshot.patch text/x-diff 4.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2015-07-31 18:10:01 Re: RequestAddinLWLocks(int n)
Previous Message Robert Haas 2015-07-31 17:43:45 Re: patch: prevent user from setting wal_buffers over 2GB bytes