Re: 16-bit page checksums for 9.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 13:32:04
Message-ID: CA+Tgmob3jaJXJLaH6g9h51WLwPgmTC6Wu=wXibqUd=UALQbDaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> We do, in numerous places, drop a shared buffer content lock and reacquire it
> in exclusive mode.  However, the existing users of that pattern tend to trade
> the lock, complete subsequent work, and unlock the buffer all within the same
> function.  So they must, because several of them recheck facts that can change
> during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
> do not adhere to that pattern; they can be quite distant from the original
> lock acquisition and eventual lock release.  Take the prototypical case of
> SetHintBits().  Our shared lock originated in a place like heapgettup(), which
> expects to hold it continually until finished.  We would need to somehow push
> up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
> lock, then have heapgettup() reorient itself as needed.  Every caller of code
> that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
> Perhaps there's a different way to apply the lock-trade technique that avoids
> this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
> a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
> to be a sensible one.  Within that umbrella, some details need attention:

Fair analysis.

> - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
>  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
>  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
>  public API, perhaps LockBufferForLSN().

That seems like a good idea. I am a little worried about Simon's plan
to do the additional locking only for AMs that have no unlogged-type
updates. It's a reasonable performance optimization, and skipping the
locking when checksums are shut off also seems reasonable, but it
seems a bit fragile: suppose that, in the future, someone fixes GiST
to do something intelligent with kill_prior_tuple. Now suddenly it
needs LSN locking that it didn't need before, but this fact might not
be very obvious. It might be a good idea to design LockBufferForLSN
to take an AM OID as an argument, and use the AM OID to decide whether
the locking is required. That way, we can call it from everywhere
that reads an LSN, and it can simply skip the locking in places where
it isn't presently needed.

Or maybe there's a better design, but I agree with you that some kind
of public API is essential.

> - The use of some spinlock need not imply using the buffer header spinlock.
>  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
>  trade-off of splitting a lock: less contention at the cost of more
>  acquisitions.  I have no intuition on which approach would perform better.

I think I like the idea of a separate lock, but I agree it could stand
to be tested both ways. Another thought is that we might add
SpinLockConditionalAcquire(), that just tries to TAS the lock and
returns false if the lock is already held. Then, we could grab the
spinlock while writing the page, in lieu of copying it, and anyone who
wants to set a hint bit can conditionally acquire the spinlock long
enough to set the bits. If the spinlock isn't immediately free then
instead of spinning they just don't set the hint bits after all. That
way you can be sure that no hint bits are set during the write, but
any attempt to set a hint bit only costs you a single TAS - no loop,
as with a regular spinlock acquisition - and of course the hint
itself.

> A possible compromise is to leave the page clean after setting a hint bit,
> much like the patch already has us do under hot standby.  Then there's no new
> WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
> benchmarked when he was looking at hint bits?  Does anyone recall the result?

It slows things down substantially in the case of repeated scans of
the same unchaning table. In fact, I tried a much more nuanced
approach: set BM_UNTIDY on the page when
SetBufferCommitInfoNeedsSave() is called. BM_UNTIDY pages are written
by the background writer and during checkpoints, and 5% of the time by
backends. All of that has the salutary effect of making the first
sequential scan less ungodly slow, but it also has the less-desirable
effect of making the next 19 of them still kinda-sorta slow. It was
unclear to me (and perhaps to others) whether it really worked out to
a win.

However, we might need/want to revisit some of that logic in
connection with this patch, because it seems to me that a large
sequential scan of an unhinted table could be ferociously slow, with
the backend writing out its own pages and WAL-logging them as it goes.
It would be good to test that to figure out whether some mitigation
measures are needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-22 13:33:07 Re: VACUUM ANALYZE is faster than ANALYZE?
Previous Message Euler Taveira de Oliveira 2012-02-22 13:14:08 Re: temporal algebra and data type