Page-at-a-time Locking Considerations

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Page-at-a-time Locking Considerations
Date: 2008-02-04 16:04:43
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In heapgetpage() we hold the buffer locked while we look for visible
tuples. That works well in most cases since the visibility check is fast
if we have status bits set. If we don't have visibility bits set we have
to do things like scan the snapshot and confirm things via clog lookups.
All of that takes time and can lead to long buffer lock times, possibly
across multiple I/Os in the very worst cases.

This doesn't just happen for old transactions. Accessing very recent
TransactionIds is prone to rare but long waits when we ExtendClog().

Such problems are numerically rare, but the buffers with long lock times
are also the ones that have concurrent or at least recent write
operations on them. So all SeqScans have the potential to induce long
wait times for write transactions, even if they are scans on 1 block
tables. Tables with heavy write activity on them from multiple backends
have their work spread across multiple blocks, so a SeqScan will hit
this issue repeatedly as it encounters each current insertion point in a
table and so greatly increases the chances of it occurring.

It seems possible to just memcpy() the whole block away and then drop
the lock quickly. That gives a consistent lock time in all cases and
allows us to do the visibility checks in our own time. It might seem
that we would end up copying irrelevant data, which is true. But the
greatest cost is memory access time. If hardware memory pre-fetch cuts
in we will find that the memory is retrieved en masse anyway; if it
doesn't we will have to wait for each cache line. So the best case is
actually an en masse retrieval of cache lines, in the common case where
blocks are fairly full (vague cutoff is determined by exact mechanism of
hardware/compiler induced memory prefetch).

The copied block would be used only for visibility checks. The main
buffer would retain its pin and we would pass references to the block
through the executor as normal. So this would be a change completely
isolated to heapgetpage().

Was the copy-aside method considered when we introduced page at a time
mode? Any reasons to think it would be dangerous or infeasible? If not,
I'll give it a bash and get some test results.

Simon Riggs


Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-02-04 16:09:50 Re: Merge condition in postgresql
Previous Message Bruce Momjian 2008-02-04 15:57:18 Re: pgsql: configure tag'd 8.3.0 and built witih autoconf 2.59