Re: POC: Cache data in GetSnapshotData()

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cache data in GetSnapshotData()
Date: 2017-08-29 05:57:57
Message-ID: CAD__OujRqABhiAJLcPVBJCvDWBWVotb7kxm+ECppTa+nyfETXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Cache the SnapshotData for reuse:
===========================
In one of our perf analysis using perf tool it showed GetSnapshotData
takes a very high percentage of CPU cycles on readonly workload when
there is very high number of concurrent connections >= 64.

Machine : cthulhu 8 node machine.
----------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 2128.000
BogoMIPS: 4266.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Perf CPU cycle 128 clients.

On further analysis, it appears this mainly accounts to cache-misses
happening while iterating through large proc array to compute the
SnapshotData. Also, it seems mainly on read-only (read heavy) workload
SnapshotData mostly remains same as no new(rarely a) transaction
commits or rollbacks. Taking advantage of this we can save the
previously computed SanspshotData in shared memory and in
GetSnapshotData we can use the saved SnapshotData instead of computing
same when it is still valid. A saved SnapsotData is valid until no
transaction end.

[Perf analysis Base]

Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
Overhead Command Shared Object Symbol
18.63% postgres postgres [.] GetSnapshotData
11.98% postgres postgres [.] _bt_compare
10.20% postgres postgres [.] PinBuffer
8.63% postgres postgres [.] LWLockAttemptLock
6.50% postgres postgres [.] LWLockRelease

[Perf analysis with Patch]

Server configuration:
./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
wal_buffers=256MB &

pgbench configuration:
scale_factor = 300
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S postgres

The machine has 64 cores with this patch I can see server starts
improvement after 64 clients. I have tested up to 256 clients. Which
shows performance improvement nearly max 39% [1]. This is the best
case for the patch where once computed snapshotData is reused further.

The worst case seems to be small, quick write transactions with which
frequently invalidate the cached SnapshotData before it is reused by
any next GetSnapshotData call. As of now, I tested simple update
tests: pgbench -M Prepared -N on the same machine with the above
server configuration. I do not see much change in TPS numbers.

All TPS are median of 3 runs
Clients TPS-With Patch 05 TPS-Base %Diff
1 752.461117 755.186777 -0.3%
64 32171.296537 31202.153576 +3.1%
128 41059.660769 40061.929658 +2.49%

I will do some profiling and find out why this case is not costing us
some performance due to caching overhead.

[]

On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> I think this patch should have a "cover letter" explaining what it's
> trying to achieve, how it does so and why it's safe/correct. I think
> it'd also be good to try to show some of the worst cases of this patch
> (i.e. where the caching only adds overhead, but never gets used), and
> some of the best cases (where the cache gets used quite often, and
> reduces contention significantly).
>
> I wonder if there's a way to avoid copying the snapshot into the cache
> in situations where we're barely ever going to need it. But right now
> the only way I can think of is to look at the length of the
> ProcArrayLock wait queue and count the exclusive locks therein - which'd
> be expensive and intrusive...
>
>
> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
>> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
>> index a7e8cf2..4d7debe 100644
>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>> (arrayP->numProcs - index - 1) * sizeof(int));
>> arrayP->pgprocnos[arrayP->numProcs - 1] = -1; /* for debugging */
>> arrayP->numProcs--;
>> + ProcGlobal->is_snapshot_valid = false;
>> LWLockRelease(ProcArrayLock);
>> return;
>> }
>> }
>>
>> + ProcGlobal->is_snapshot_valid = false;
>> /* Oops */
>> LWLockRelease(ProcArrayLock);
>
> Why do we need to do anything here? And if we need to, why not in
;> ProcArrayAdd?

In ProcArrayRemove I can see ShmemVariableCache->latestCompletedXid is
set to latestXid which is going to change xmax in GetSnapshotData,
hence invalidated the snapshot there. ProcArrayAdd do not seem to be
affecting snapshotData. I have modified now to set
cached_snapshot_valid to false just below the statement
ShmemVariableCache->latestCompletedXid = latestXid only.
+if (TransactionIdIsValid(latestXid))
+{
+ Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
+ /* Advance global latestCompletedXid while holding the lock */
+ if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
+ latestXid))
+ ShmemVariableCache->latestCompletedXid = latestXid;

>> @@ -1552,6 +1557,8 @@ GetSnapshotData(Snapshot snapshot)
>> errmsg("out of memory")));
>> }
>>
>> + snapshot->takenDuringRecovery = RecoveryInProgress();
>> +
>> /*
>> * It is sufficient to get shared lock on ProcArrayLock, even if we are
>> * going to set MyPgXact->xmin.
>> @@ -1566,100 +1573,200 @@ GetSnapshotData(Snapshot snapshot)
>> /* initialize xmin calculation with xmax */
>> globalxmin = xmin = xmax;
>>
>> - snapshot->takenDuringRecovery = RecoveryInProgress();
>> -
>
> This movement of code seems fairly unrelated?

-- Yes not related, It was part of early POC patch so retained it as
it is. Now reverted those changes moved them back under the
ProcArrayLock

>> if (!snapshot->takenDuringRecovery)
>> {
>
> Do we really need to restrict this to !recovery snapshots? It'd be nicer
> if we could generalize it - I don't immediately see anything !recovery
> specific in the logic here.

Agree I will add one more patch on this to include standby(recovery
state) case also to unify all the cases where we can cache the
snapshot. Before let me see

>> - int *pgprocnos = arrayP->pgprocnos;
>> - int numProcs;
>> + if (ProcGlobal->is_snapshot_valid)
>> + {
>> + Snapshot csnap = &ProcGlobal->cached_snapshot;
>> + TransactionId *saved_xip;
>> + TransactionId *saved_subxip;
>>
>> - /*
>> - * Spin over procArray checking xid, xmin, and subxids. The goal is
>> - * to gather all active xids, find the lowest xmin, and try to record
>> - * subxids.
>> - */
>> - numProcs = arrayP->numProcs;
>> - for (index = 0; index < numProcs; index++)
>> + saved_xip = snapshot->xip;
>> + saved_subxip = snapshot->subxip;
>> +
>> + memcpy(snapshot, csnap, sizeof(SnapshotData));
>> +
>> + snapshot->xip = saved_xip;
>> + snapshot->subxip = saved_subxip;
>> +
>> + memcpy(snapshot->xip, csnap->xip,
>> + sizeof(TransactionId) * csnap->xcnt);
>> + memcpy(snapshot->subxip, csnap->subxip,
>> + sizeof(TransactionId) * csnap->subxcnt);
>> + globalxmin = ProcGlobal->cached_snapshot_globalxmin;
>> + xmin = csnap->xmin;
>> +
>> + count = csnap->xcnt;
>> + subcount = csnap->subxcnt;
>> + suboverflowed = csnap->suboverflowed;
>> +
>> + Assert(TransactionIdIsValid(globalxmin));
>> + Assert(TransactionIdIsValid(xmin));
>> + }
>
> Can we move this to a separate function? In fact, could you check how
> much effort it'd be to split cached, !recovery, recovery cases into
> three static inline helper functions? This is starting to be hard to
> read, the indentation added in this patch doesn't help... Consider doing
> recovery, !recovery cases in a preliminary separate patch.

In this patch, I have moved the code 2 functions one to cache the
snapshot data another to get the snapshot data from the cache. In the
next patch, I will try to unify these things with recovery (standby)
case. For recovery case, we are copying the xids from a simple xid
array KnownAssignedXids[], I am not completely sure whether caching
them bring similar performance benefits as above so I will also try to
get a perf report for same.

>
>> + * Let only one of the caller cache the computed snapshot, and
>> + * others can continue as before.
>> */
>> - if (!suboverflowed)
>> + if (!ProcGlobal->is_snapshot_valid &&
>> + LWLockConditionalAcquire(&ProcGlobal->CachedSnapshotLock,
>> + LW_EXCLUSIVE))
>> {
>
> I'd also move this to a helper function (the bit inside the lock).

Fixed as suggested.

>> +
>> + /*
>> + * The memory barrier has be to be placed here to ensure that
>> + * is_snapshot_valid is set only after snapshot is cached.
>> + */
>> + pg_write_barrier();
>> + ProcGlobal->is_snapshot_valid = true;
>> + LWLockRelease(&ProcGlobal->CachedSnapshotLock);
>
> LWLockRelease() is a full memory barrier, so this shouldn't be required.

Okay, Sorry for my understanding, Do you want me to set
ProcGlobal->is_snapshot_valid after
LWLockRelease(&ProcGlobal->CachedSnapshotLock)?

>
>> /*
>> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
>> index 7dbaa81..7d06904 100644
>> --- a/src/include/storage/proc.h
>> +++ b/src/include/storage/proc.h
>> @@ -16,6 +16,7 @@
>>
>> #include "access/xlogdefs.h"
>> #include "lib/ilist.h"
>> +#include "utils/snapshot.h"
>> #include "storage/latch.h"
>> #include "storage/lock.h"
>> #include "storage/pg_sema.h"
>> @@ -253,6 +254,22 @@ typedef struct PROC_HDR
>> int startupProcPid;
>> /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
>> int startupBufferPinWaitBufId;
>> +
>> + /*
>> + * In GetSnapshotData we can reuse the previously snapshot computed if no
>> + * new transaction has committed or rolledback. Thus saving good amount of
>> + * computation cycle under GetSnapshotData where we need to iterate
>> + * through procArray every time we try to get the current snapshot. Below
>> + * members help us to save the previously computed snapshot in global
>> + * shared memory and any process which want to get current snapshot can
>> + * directly copy from them if it is still valid.
>> + */
>> + SnapshotData cached_snapshot; /* Previously saved snapshot */
>> + volatile bool is_snapshot_valid; /* is above snapshot valid */
>> + TransactionId cached_snapshot_globalxmin; /* globalxmin when above
>
>
> I'd rename is_snapshot_valid to 'cached_snapshot_valid' or such, it's
> not clear from the name what it refers to.

-- Renamed to cached_snapshot_valid.

>
>> + LWLock CachedSnapshotLock; /* A try lock to make sure only one
>> + * process write to cached_snapshot */
>> } PROC_HDR;
>>
>
> Hm, I'd not add the lock to the struct, we normally don't do that for
> the "in core" lwlocks that don't exist in a "configurable" number like
> the buffer locks and such.
>

-- I am not really sure about this. Can you please help what exactly I
should be doing here to get this corrected? Is that I have to add it
to lwlocknames.txt as a new LWLock?

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-08-29 06:00:57 Re: show "aggressive" or not in autovacuum logs
Previous Message Andrey Borodin 2017-08-29 05:16:59 Re: Adding hook in BufferSync for backup purposes