Hot Standby: Caches and Locks

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby: Caches and Locks
Date: 2008-10-21 14:06:20
Message-ID: 1224597980.27145.90.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Next stage is handling locks and proc interactions. While this has been
on Wiki for a while, I have made a few more improvements, so please read
again now.

Summary of Proposed Changes
---------------------------

* New RMgr using rmid==8 => RM_RELATION_ID (which fills last gap)
* Write new WAL message, XLOG_RELATION_INVAL immediately prior commit
* LockAquire() write new WAL message, XLOG_RELATION_LOCK
* Startup process queues sinval message when it sees XLOG_RELATION_INVAL
* Startup process takes and holds AccessExclusiveLock when it processes
XLOG_RELATION_LOCK message
* At xact_commit_redo we fire sinval messages and then release locks for
that transaction

Explanations
------------

All read-only transactions need to maintain various caches: relcache,
catcache and smgr cache. Theses caches will be maintained on each
backend normally, re-reading catalog tables when invalidation messages
are received.

Invalidation messages will be sent by the Startup process. The Startup
process will not maintain its own copy of the caches, so will never
receive messages, only send them. XLOG_RELATION_INVAL messages will be
sent immediately prior to commit (only) using new function
LogCacheInval(), and also during EndNonTransactionalInvalidation(). We
do nothing at subtransaction commit. WAL record will contain a simple
contiguous array of SharedInvalidationMessage(s) that need to be sent.
If nothing to do, no WAL record.

We can't send sinval messages after commit in case we crash and fail to
write WAL for them. We can't augment the commit/abort messages because
we must cater for non-transactional invalidations also, plus commit
xlrecs are already complex enough. So we log invalidations prior to
commit, queue them and then trigger the send at commit (if it happens).
We need do nothing in the abort case because we are not maintaining our
own caches in the Startup process. In the nontransactional invalidation
case we would process WAL records immediately.

Startup process will need to initialise using SharedInvalBackendInit()
which is not normally executed by auxiliary processes. Startup would
call this from AuxiliaryProcessMain() just before we call StartupXLOG().
We will need an extra slot in state arrays to allow for Startup process.

Startup process needs to reset its sinval nextMsgNum, so everybody
thinks it has read messages. It will be unprepared to handle catchup
requests if they were received for some reason, since only the Startup
process is sending messages at this point.

Startup process will continue to use XLogReadBuffer(), minimising the
changes required in current ResourceManager's _redo functions - there
are still some, see later. It also means that read-only backends will
use ReadBuffer() calls normally, so again, no changes required
throughout the normal executor code.

Locks will be taken by the Startup process when it receives a new WAL
message. XLOG_RELATION_LOCK messages will be sent each time a backend
*successfully* acquires an AccessExclusiveLock (only). We send it
immediately after the lock acquisition, which means we will often be
sending lock requests with no TransactionId assigned, so the slotId is
essential in tying up the lock request with the commit that later
releases it, since the commit does not include the vxid.

In recovery, transactions will not be permitted to take any lock higher
than AccessShareLock on an object, nor assign a TransactionId. This
should also prevent us from writing WAL, but we protect against that
specifically as well, just in case. (Maybe we can relax that to Assert
sometime later). We can dirty data blocks but only to set hint bits.
(That's another reason to differentiate between those two cases anyway).
Note that in recovery, we will always be allowed to set hint bits - no
need to check for asynchronous commits. All other actions which cause
dirty data blocks should not be allowed, though this will be just an
Assert. Specifically, HOT pruning will not be allowed in recovery mode.

Since read-only backends will only be allowed to take AccessShareLocks
there will be no lock conflicts apart from with AccessExclusiveLocks.
(If we allowed higher levels of lock we would then need to maintain
Multitrans to examine lock details, which we would also rather avoid).
So Startup process will not take, hold or release relation locks for any
purpose, *apart* from when AccessExclusiveLocks (AELs) are required. So
we will send WAL messages *only* for AELs.

The Startup process will emulate locking behaviour for transactions that
require AELs. AELs will be held by first inserting a dummy
TransactionLock entry into the lock table with the TransactionId of the
transaction that requests the lock. Then the lock entry will be made.
Locks will be released when processing a transaction commit, abort or
shutdown checkpoint message, and the lock table entry for the
transaction will be removed.

Any AEL request that conflicts with an existing lock will cause some
action: if it conflicts with an existing AEL then we issue a WARNING;
this should never have happened, but if it has it indicates that the
last transaction died with a FATAL error without writing an abort
record. If the AEL request conflicts with a read-only backend then we
wait for a while (as discussed previously) then the read-only backend
will receive a cancel message to make it go away.

If Startup process crashes it is a PANIC anyway, so there is no
difficulties in cleanup for the lock manager with this approach.

The LOCK TABLE command by default applies an AccessExclusiveLock. This
will generate WAL messages when executed on the primary node. When
executed on the standby node the default will be to issue an
AccessShareLock. Any LOCK TABLE command that runs on the standby and
requests a specific lock type other than AccessShareLock will be
rejected.

Note that it will not be possible to deadlock, since the Startup process
will receive only "already held" lock requests, and the query backends
will not be allowed to request locks that could cause deadlocks. This is
important because the Startup process should never die because of a
deadlock, it should always be the "other guy", else we probably should
PANIC. Advisory locks seem a problem here. My initial thought is to just
prevent them working during Hot Standby. We may relax that restriction
in a later release.

Code for sinvaladt message handling needs little change. It is already
generalised to allow any process to put messages onto the queue without
keeping state on a per-backend basis for those messages.

Code for locks messages needs to be generalised to allow the Startup
process to request locks by proxy for the transactions it is emulating.
Majority of refactoring will occur here. Fiddly, but no problems
foreseen.

Have I missed anything? Would anybody like more details anywhere?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2008-10-21 14:10:16 Re: corrupted pg_proc?
Previous Message Merlin Moncure 2008-10-21 14:01:17 Re: binary representation of datatypes