Transaction Snapshots and Hot Standby

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Transaction Snapshots and Hot Standby
Date: 2008-09-11 00:07:55
Message-ID: 1221091675.3913.806.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Transaction snapshots is probably the most difficult problem for Hot
Standby to resolve. We *must* have a transaction snapshot to access
table data in a consistent way, or we must accept some inconsistency,
though that is not simple either.

We can derive transaction snapshots

* remotely from primary node
* locally on the standby node

If we derive a snapshot locally, then we will end up with a situation
where the xmin of the local snapshot precedes the xmin of the primary
node. When this occurs it will then be possible for WAL records to
arrive on the standby that request removal of rows that a transaction
might wish to see. Preventing that situation can be done by either
deferring WAL apply or by cancelling queries.

We can defer WAL apply for particular tables only, but this could
significantly complicate the apply process and is not a suggested option
for the first release of this feature.

We might control query cancellation by tracking which tables have had
rows removed that would have been visible to particular queries. Again,
possible but suffers from the problem that tables on which HOT is
frequently active would be almost unusable. So not a general solution.

Simply ignoring WAL removal has been discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php

So the most generally applicable and simplest initial solution for
generating snapshots is to take them from the remote primary node. The
proposal for this follows:

Transaction snapshots will be managed by a new process, Snapproc.
Snapproc will start only in recovery mode and will exit when normal
processing mode begins. Snapshot process will connect to the remote
primary node and export snapshot data and copy this to shared memory on
the standby node. When all standby backends have run
UnregisterSnapshot() then the snapshot will then be unregistered on the
remote primary node also.

The standby must not think a transaction is visible until all changes
made by it are have been applied. So snapshots from the primary cannot
be used on the standby until the LSN at which they were taken has been
reached by WAL apply on the standby. Snapshots don't normally have an
LSN, so we must derive that information as well when we take a Snapshot.
With asynchronous commits a transaction can be committed and yet not yet
written to disk, so we cannot rely on the current WAL write pointer.
Taking the WAL insert pointer is the safest thing to do, though most
heavily contended. We don't want to hold ProcArrayLock while requesting
WALInsertLock, so we will get the LSN of the WAL insert pointer *after*
the Snapshot has been derived (it doesn't matter that much, as long as
its not before the snapshot).

So snapshots from the primary take time before they can be used. The
delay is exactly the current processing delay from primary to standby,
so another reason why we do not wish to fall behind.

Taking snapshots from primary has a few disadvantages

* snapshots take time before they are usable
* requesting new snapshots is via remote request
* snapshots on primary prevent row removal (but this was also an
advantage of this technique!)

If primary and standby node are connected by private ethernet then the
snapshot request time will be ~10ms, though that is probably 100 times
slower than current snapshot access. If primary and standby are opposite
sides of an ocean then times could be as high as 500ms. Taking snapshots
directly could be annoyingly slow for small read-only statements, so we
need to buffer this process in various ways. For larger queries, this
may not be a problem at all, but we do not wish to limit or assume the
uses of the standby node.

First of all, each backend will have the opportunity to reuse previous
snapshots both within transactions and across them. A userset parameter
snapshot_reuse_window = 0..60000ms will define the time window in which
any request for a new snapshot will simply result in being fed the last
snapshot again. When the window on a snapshot has expired a newer
snapshot will be presented instead. This idea is similar to serializable
transactions, which continually reuse the same snapshot. This is a
useful parameter for normal processing as well, since it will reduce
contention on the ProcArrayLock for many applications.

Snapshots can be reused across transactions in recovery mode, since they
are held in shared memory. Amount of shared memory dedicated to storing
snapshots will be max_connections * max size of snapshots.

Since there is a delay between obtaining a new snapshot and it becoming
usable the Snapshot process will buffer them until they become "mature",
like a good Whiskey. Snapshot process will take regular snapshots and
pre-age them so that when a backend requests a snapshot it will be given
the most recently matured snapshot. Time between snapshots is set by
snapshot_preread_timeout = 0..60000ms. If a request for a snapshot
arrives and there are no snapshots waiting to mature then this will
trigger snapshot process to request a new primary snapshot. (This
parameter could be automatically set based upon the arrival rate of
snapshot requests, but this is a something to consider later).

If snapshot_reuse_window = 0 then a backend will be presented with a
freshly obtained snapshot and will then wait until the exact first
moment it can be used before returning.

We can continue to reuse snapshots from the primary even if the primary
crashes, becomes disconnected or is shutdown/restarted. New snapshots
are obviously not possible until it appears again. It's not that common
for us to lose contact with the primary *and* for it to *not* be a
failover, so this seems like an acceptable restriction.

Other alternatives? Maybe.

In earlier discussions, I suggested that we might use "Read Uncommitted"
mode for use with Hot Standby. Tom pointed out that what I had suggested
was not "Read Uncommitted" as described by SQL Standard.

For Hot Standby, I've had a look at the various transaction isolation
modes possible to see which, if any, are desirable:

1. The first is to just accept that transaction *isolation* is not
possible, but carry on anyway. We might call this READ INCONSISTENT -
which carries the problems noted by Tom earlier. This mode allows us to
"see" committed deletes because their row versions might be missing from
our query, it will cause updated records to disappear from queries and
yet at the same time for newly inserted data to be invisible. However,
if the table is insert only or read only this will give a consistent
result, so this mode *is* appealing to many potential users. (It's been
suggested to me twice, independently.) We would need to derive a
snapshot based upon which transactions are "running" according to the
replay of transactions during recovery.

2. The second is to throw an error if transaction isolation is violated.
Currently we can only do this approximately, by recording the global LSN
at start of query and then aborting the query if we touch a data block
that has changes made after the LSN. Slightly more extreme version of
(1), but always consistent if it completes. (Somebody has joked we could
make the error message "snapshot too old").

We can implement both of those with an additional parameter,
allow_xmin_advance = off (default) | on Mode (1) is allow_xmin_advance =
on in READ COMMITTED mode Mode (2) is allow_xmin_advance = on in
SERIALIZABLE mode

3. We can also implement Read Uncommitted mode itself in the database.
We would get inconsistent results for updates and deletes *and* inserts
but at least we would always see the latest version of an updated row,
rather than skipping them completely. On a read only table this might be
very useful. On an insert only table this might result in some long
running queries as a scan struggles to keep up with inserts!

For (3) we would need to retrieve tuple from executor scan nodes in
SnapshotDirty mode. If we see a tuple we would follow its tuple chain to
another row on the same block using EvalPlanQual() logic. If we are
using an IndexScan and the update chain goes off-block then there will
be another index pointer that we (may) see to access the tuple, so we
would stop at that point. If we are using a SeqScan we should follow the
tuple chain until we see the top tuple, even if it goes off-block.
Bitmap scans would cause problems.

Deriving snapshots from the master seems the most obvious, but lets see
where the dice roll...

Other topics on Hot Standby will be covered later; this is just the
first part.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2008-09-11 00:10:17 Re: Proposed patch: make SQL interval-literal syntaxwork per spec
Previous Message Kevin Grittner 2008-09-10 23:50:25 Re: Proposed patch: make SQL interval-literal syntax work per spec