Quick Links

Re: brin index vacuum versus transaction snapshots

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: brin index vacuum versus transaction snapshots
Date:	2015-07-31 17:57:00
Message-ID:	20150731175700.GX2441@postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Let me explain a bit more what's the mechanism at play.

When scanning a block range to summarize it, we use an MVCC snapshot.
Any tuples that are inserted by transactions that appear as in-progress
to that snapshot will not be seen by that range scan; therefore we need
another mechanism to include their values in the summary tuple.

This mechanism is a "placeholder tuple." We insert this tuple into the
index; any process that tries to insert into the range will update the
placeholder tuple. All insertions that are not visible according to the
MVCC scan must update the placeholder tuple.

So there are two events that are time-critical regarding each other: the
insertion of the placeholder tuple, and the acquisition of the snapshot
for the MVCC scan. All transactions in progress for the snapshot *must*
see the placeholder tuple; therefore, we make sure we insert the
placeholder tuple first, and acquire the snapshot later.

This is all regardless of a transaction being a user-declared
transaction block, or an implicit transaction opened by some command;
regardless of vacuum being user-invoked or autovacuum. The description
of the above is critical for correctness in all those cases.

I assumed (wrongly) that vacuum would not acquire a snapshot. (For one
thing, we already rely for correctness on lazy vacuum not holding xmin
back, so this is not entirely without sense.) The problem is that
PortalRunUtility does it inconditionally. This is not a problem in
read-committed mode, because we simply freshen the snapshot each time
and all is well. In transaction-snapshot mode, taking another snapshot
is a no-op and thus the whole thing is broken. The code is right to
raise an error. (I don't really give a damn whether the error message
is 100% correct or not. Fixing that part is trivial.)

Even assuming that PortalRunUtility would not grab a snapshot at start,
things would still be broken if vacuum processed more than one range in
transaction-snapshot isolation mode: insert placeholder tuple, grab
snapshot for the first range, scan range: all is well up to this point.
Then we need to process the second range: insert placeholder tuple, grab
snapshot ... which reuses the previous snapshot, older than the
placeholder tuple. Oops.

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot(). As far
as I can see, that should completely close the hole. This requires
patching IndexBuildHeapRangeScan() to allow for that.

Untested patch attached.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
brin-snapshot.patch	text/x-diff	4.7 KB

In response to

Re: brin index vacuum versus transaction snapshots at 2015-07-31 14:30:52 from Kevin Grittner

Responses

Re: brin index vacuum versus transaction snapshots at 2015-07-31 19:45:24 from Alvaro Herrera
Re: brin index vacuum versus transaction snapshots at 2015-08-01 17:40:24 from Kevin Grittner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jesper Pedersen	2015-07-31 18:10:01	Re: RequestAddinLWLocks(int n)
Previous Message	Robert Haas	2015-07-31 17:43:45	Re: patch: prevent user from setting wal_buffers over 2GB bytes