Re: Avoiding repeated snapshot computation

From: Andres Freund <andres(at)anarazel(dot)de>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, singh(dot)gurjeet(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-29 20:55:52
Message-ID: 201111292155.52460.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote:
> On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
>
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > Andres Freund wrote:
> >> I would like to see somebody running a benchmark on a machine with
> >> higher concurrency...
> >
> > Yeah, me too. I don't find it at all hard to believe that we will
> > see significant performance benefit by changing the PGPROC structure
> > so that all parts of it can be accessed through one cache line rather
> > than two. The fact that we saw improvement when changing it down to
> > two, even though there is extra indirection in some places, seems
> > like pretty solid evidence on the face of it.
>
> I think there is fundamental difference between the PGPROC patch that
> we did and the rearranging SnapshotData members that Andres has
> proposed. The PGPROC structure is a shared memory area, very heavily
> accessed by GetSnapshotData (and some others). There was a linear
> increase in the access as number of clients go up. The SnapshotData
> structure is mostly from a backend local memory (unless we do
> something what I suggested in the OP) and hence fitting that in a
> single cache line may or may not have much impact. We don't even
> guarantee that it starts at a cacheline which means in most cases it
> will still be spread on multiple cache lines.
Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see
the earlier reformatted mail from Gurjeet) and the benefits were definitely
smaller without concurrency. But I aggree - the benefits wont be even remotely
as big as the PGPROC splitup patch.
Youre also right about the alignment not being predictable enough.

Perhaps pg should start using/introducing a force_cacheline_align macro which
can easily implemented on most compilers? Especially for SnapshotData and
consorts which are aligned statically that should be good enough.

Something along thelines of

#ifdef _GNUC__
#define force_align_to(alignment) __attribute__((align(alignment)))
#elif __MSVC__
#define force_align_to(alignment) __declspec(align(alignment))
#else
#define force_align_to(alignment)
#endif

#define CACHELIGN_ALIGNMENT 64
#define force_cacheline_align force_align_to(CACHELING_ALIGNMENT)

Looks complete enough for now.

Back to PGPROC:

Looking at the current PGPROC layout on x86-64 with linux abi (generated by
pahole):

struct PGPROC {
SHM_QUEUE links; /* 0 16 */
PGSemaphoreData sem; /* 16 8 */
int waitStatus; /* 24 4 */
Latch procLatch; /* 28 12 */
LocalTransactionId lxid; /* 40 4 */
int pid; /* 44 4 */
int pgprocno; /* 48 4 */
BackendId backendId; /* 52 4 */
Oid databaseId; /* 56 4 */
Oid roleId; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
bool recoveryConflictPending; /* 64 1 */
bool lwWaiting; /* 65 1 */
bool lwExclusive; /* 66 1 */

/* XXX 5 bytes hole, try to pack */

struct PGPROC * lwWaitLink; /* 72 8 */
LOCK * waitLock; /* 80 8 */
PROCLOCK * waitProcLock; /* 88 8 */
LOCKMODE waitLockMode; /* 96 4 */
LOCKMASK heldLocks; /* 100 4 */
XLogRecPtr waitLSN; /* 104 8 */
int syncRepState; /* 112 4 */

/* XXX 4 bytes hole, try to pack */

SHM_QUEUE syncRepLinks; /* 120 16 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
SHM_QUEUE myProcLocks[16]; /* 136 256 */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
struct XidCache subxids; /* 392 256 */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
LWLockId backendLock; /* 648 4 */

/* XXX 4 bytes hole, try to pack */

uint64 fpLockBits; /* 656 8 */
Oid fpRelId[16]; /* 664 64 */
/* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */
bool fpVXIDLock; /* 728 1 */

/* XXX 3 bytes hole, try to pack */

LocalTransactionId fpLocalTransactionId; /* 732 4 */

/* size: 736, cachelines: 12, members: 28 */
/* sum members: 720, holes: 4, sum holes: 16 */
/* last cacheline: 32 bytes */
}

It seems possible that reordering it to avoid sharing unrelated stuff on the
same cacheline might be beneficial. E.g. moving fastpath lock stuf away from
the more common lwlock infrastructure might be a good idea. Currently those
lines will bounce continually even though the fastpath stuff will mostly be
accessed locally.
Subxids and syncRepLinks look like good candidates too.

Anyway, sorry for the topic drift.

Andres

Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Brar Piening 2011-11-29 21:32:01 Re: Review of VS 2010 support patches
Previous Message Bruce Momjian 2011-11-29 20:34:40 Re: Inlining comparators as a performance optimisation