Re: Reducing ClogControlLock contention

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reducing ClogControlLock contention
Date: 2015-08-25 08:51:10
Message-ID: CANP8+jLJ2nX+O7AbxF-vYHYuxuX91DXRO2_P9cyrPVFV=tBrrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22 August 2015 at 15:14, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2015-06-30 08:02:25 +0100, Simon Riggs wrote:
> > Proposal for improving this is to acquire the ClogControlLock in Shared
> > mode, if possible.
>
> I find that rather scary. That requires for all read and write paths in
> clog.c and slru.c to only ever read memory locations once. Otherwise
> those reads may not get the same results. That'd mean we'd need to add
> indirections via volatile to all reads/writes. It also requires that we
> never read in 4 byte quantities.

There is is a very specific case in which this is possible. We already
allow writes to data structures in the transaction manager without locks
*in certain cases* and this is all that is proposed here. Nothing scary
about doing something we already do, as long as we get it right.

> > This is safe because people checking visibility of an xid must always run
> > TransactionIdIsInProgress() first to avoid race conditions, which will
> > always return true for the transaction we are currently committing. As a
> > result, we never get concurrent access to the same bits in clog, which
> > would require a barrier.
>
> I don't think that's really sufficient. There's a bunch of callers do
> lookups without such checks, e.g. in heapam.c. It might be safe due to
> other circumstances, but at the very least this is a significant but
> sublte API revision.

I've checked the call sites you mention and they refer to tests made AFTER
we know have waited for the xid to complete via the lock manager. So as of
now, there are no callers of TransactionIdGetStatus() that have not already
confirmed that the xid is no longer active in the lock manager or the
procarray. Since we set clog before touching lock manager or procarray we
can be certain there is no concurrent reads and writes.

> TransactionIdSetPageStatus() calls TransactionIdSetStatusBit(), which
> writes an 8 byte variable (the lsn). That's not safe.
>

Agreed, thanks for spotting that.

I see this as the biggest problem to overcome with this patch.

We write WAL in pages, so we don't need to store the low order bits (varies
according to WAL page size). We seldom use the higher order bits, since it
takes a while to go thru (8192 * INT_MAX) = 32PB of WAL. So it seems like
we can have a base LSN for a whole clog page, plus 4-byte LSN offsets. That
way we can make the reads and writes atomic on all platforms. All of that
can be hidden in clog.c in macros, so low impact, modular code.

> A patch like this will need far more changes. Every read and write from
> a page has to be guaranteed to only be done once, otherwise you can get
> 'effectively torn' values.
>
> That is, you can't just do
> static void
> TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr
> lsn, int slotno)
> ...
> char *byteptr;
> char byteval;
> ...
> /* note this assumes exclusive access to the clog page */
> byteval = *byteptr;
> byteval &= ~(((1 << CLOG_BITS_PER_XACT) - 1) << bshift);
> byteval |= (status << bshift);
> *byteptr = byteval;
> ...
>
> the compiler is entirely free to "optimize" away the byteval variable
> and do all these masking operations on memory! It can intermittenly
> write temporary values to byteval, because e.g. the register pressure is
> too high.
>
> To actually rely on single-copy-atomicity you have to enforce that these
> reads/writes can only happen. Leavout out some possible macro
> indirection that'd have to look like
> byteval = (volatile char *) byteptr;
> ...
> *(volatile char *) byteptr = byteval;
> some for TransactionIdGetStatus(), without the write side obviously.
>

Seems doable.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-08-25 09:05:44 Re: Commitfest remaining "Needs Review" items
Previous Message Heikki Linnakangas 2015-08-25 08:51:05 Re: Commitfest remaining "Needs Review" items