Re: POC: Cache data in GetSnapshotData()

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cache data in GetSnapshotData()
Date: 2017-08-29 07:09:42
Message-ID: CAD__OugfQeL-8irv2OavhGAYJvU7VQkqe-WtT5r_X4TKxBU-kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all please ignore this mail, this email is incomplete I have to add
more information and Sorry accidentally I pressed the send button
while replying.

On Tue, Aug 29, 2017 at 11:27 AM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> Cache the SnapshotData for reuse:
> ===========================
> In one of our perf analysis using perf tool it showed GetSnapshotData
> takes a very high percentage of CPU cycles on readonly workload when
> there is very high number of concurrent connections >= 64.
>
> Machine : cthulhu 8 node machine.
> ----------------------------------------------
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Thread(s) per core: 2
> Core(s) per socket: 8
> Socket(s): 8
> NUMA node(s): 8
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 47
> Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
> Stepping: 2
> CPU MHz: 2128.000
> BogoMIPS: 4266.63
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 24576K
> NUMA node0 CPU(s): 0,65-71,96-103
> NUMA node1 CPU(s): 72-79,104-111
> NUMA node2 CPU(s): 80-87,112-119
> NUMA node3 CPU(s): 88-95,120-127
> NUMA node4 CPU(s): 1-8,33-40
> NUMA node5 CPU(s): 9-16,41-48
> NUMA node6 CPU(s): 17-24,49-56
> NUMA node7 CPU(s): 25-32,57-64
>
> Perf CPU cycle 128 clients.
>
> On further analysis, it appears this mainly accounts to cache-misses
> happening while iterating through large proc array to compute the
> SnapshotData. Also, it seems mainly on read-only (read heavy) workload
> SnapshotData mostly remains same as no new(rarely a) transaction
> commits or rollbacks. Taking advantage of this we can save the
> previously computed SanspshotData in shared memory and in
> GetSnapshotData we can use the saved SnapshotData instead of computing
> same when it is still valid. A saved SnapsotData is valid until no
> transaction end.
>
> [Perf analysis Base]
>
> Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
> Overhead Command Shared Object Symbol
> 18.63% postgres postgres [.] GetSnapshotData
> 11.98% postgres postgres [.] _bt_compare
> 10.20% postgres postgres [.] PinBuffer
> 8.63% postgres postgres [.] LWLockAttemptLock
> 6.50% postgres postgres [.] LWLockRelease
>
>
> [Perf analysis with Patch]
>
>
> Server configuration:
> ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
> wal_buffers=256MB &
>
> pgbench configuration:
> scale_factor = 300
> ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S postgres
>
> The machine has 64 cores with this patch I can see server starts
> improvement after 64 clients. I have tested up to 256 clients. Which
> shows performance improvement nearly max 39% [1]. This is the best
> case for the patch where once computed snapshotData is reused further.
>
> The worst case seems to be small, quick write transactions with which
> frequently invalidate the cached SnapshotData before it is reused by
> any next GetSnapshotData call. As of now, I tested simple update
> tests: pgbench -M Prepared -N on the same machine with the above
> server configuration. I do not see much change in TPS numbers.
>
> All TPS are median of 3 runs
> Clients TPS-With Patch 05 TPS-Base %Diff
> 1 752.461117 755.186777 -0.3%
> 64 32171.296537 31202.153576 +3.1%
> 128 41059.660769 40061.929658 +2.49%
>
> I will do some profiling and find out why this case is not costing us
> some performance due to caching overhead.
>
>
> []
>
> On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Hi,
>>
>> I think this patch should have a "cover letter" explaining what it's
>> trying to achieve, how it does so and why it's safe/correct. I think
>> it'd also be good to try to show some of the worst cases of this patch
>> (i.e. where the caching only adds overhead, but never gets used), and
>> some of the best cases (where the cache gets used quite often, and
>> reduces contention significantly).
>>
>> I wonder if there's a way to avoid copying the snapshot into the cache
>> in situations where we're barely ever going to need it. But right now
>> the only way I can think of is to look at the length of the
>> ProcArrayLock wait queue and count the exclusive locks therein - which'd
>> be expensive and intrusive...
>>
>>
>> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
>>> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
>>> index a7e8cf2..4d7debe 100644
>>> --- a/src/backend/storage/ipc/procarray.c
>>> +++ b/src/backend/storage/ipc/procarray.c
>>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>>> (arrayP->numProcs - index - 1) * sizeof(int));
>>> arrayP->pgprocnos[arrayP->numProcs - 1] = -1; /* for debugging */
>>> arrayP->numProcs--;
>>> + ProcGlobal->is_snapshot_valid = false;
>>> LWLockRelease(ProcArrayLock);
>>> return;
>>> }
>>> }
>>>
>>> + ProcGlobal->is_snapshot_valid = false;
>>> /* Oops */
>>> LWLockRelease(ProcArrayLock);
>>
>> Why do we need to do anything here? And if we need to, why not in
> ;> ProcArrayAdd?
>
> In ProcArrayRemove I can see ShmemVariableCache->latestCompletedXid is
> set to latestXid which is going to change xmax in GetSnapshotData,
> hence invalidated the snapshot there. ProcArrayAdd do not seem to be
> affecting snapshotData. I have modified now to set
> cached_snapshot_valid to false just below the statement
> ShmemVariableCache->latestCompletedXid = latestXid only.
> +if (TransactionIdIsValid(latestXid))
> +{
> + Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
> + /* Advance global latestCompletedXid while holding the lock */
> + if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
> + latestXid))
> + ShmemVariableCache->latestCompletedXid = latestXid;
>
>
>
>>> @@ -1552,6 +1557,8 @@ GetSnapshotData(Snapshot snapshot)
>>> errmsg("out of memory")));
>>> }
>>>
>>> + snapshot->takenDuringRecovery = RecoveryInProgress();
>>> +
>>> /*
>>> * It is sufficient to get shared lock on ProcArrayLock, even if we are
>>> * going to set MyPgXact->xmin.
>>> @@ -1566,100 +1573,200 @@ GetSnapshotData(Snapshot snapshot)
>>> /* initialize xmin calculation with xmax */
>>> globalxmin = xmin = xmax;
>>>
>>> - snapshot->takenDuringRecovery = RecoveryInProgress();
>>> -
>>
>> This movement of code seems fairly unrelated?
>
> -- Yes not related, It was part of early POC patch so retained it as
> it is. Now reverted those changes moved them back under the
> ProcArrayLock
>
>>> if (!snapshot->takenDuringRecovery)
>>> {
>>
>> Do we really need to restrict this to !recovery snapshots? It'd be nicer
>> if we could generalize it - I don't immediately see anything !recovery
>> specific in the logic here.
>
> Agree I will add one more patch on this to include standby(recovery
> state) case also to unify all the cases where we can cache the
> snapshot. Before let me see
>
>
>>> - int *pgprocnos = arrayP->pgprocnos;
>>> - int numProcs;
>>> + if (ProcGlobal->is_snapshot_valid)
>>> + {
>>> + Snapshot csnap = &ProcGlobal->cached_snapshot;
>>> + TransactionId *saved_xip;
>>> + TransactionId *saved_subxip;
>>>
>>> - /*
>>> - * Spin over procArray checking xid, xmin, and subxids. The goal is
>>> - * to gather all active xids, find the lowest xmin, and try to record
>>> - * subxids.
>>> - */
>>> - numProcs = arrayP->numProcs;
>>> - for (index = 0; index < numProcs; index++)
>>> + saved_xip = snapshot->xip;
>>> + saved_subxip = snapshot->subxip;
>>> +
>>> + memcpy(snapshot, csnap, sizeof(SnapshotData));
>>> +
>>> + snapshot->xip = saved_xip;
>>> + snapshot->subxip = saved_subxip;
>>> +
>>> + memcpy(snapshot->xip, csnap->xip,
>>> + sizeof(TransactionId) * csnap->xcnt);
>>> + memcpy(snapshot->subxip, csnap->subxip,
>>> + sizeof(TransactionId) * csnap->subxcnt);
>>> + globalxmin = ProcGlobal->cached_snapshot_globalxmin;
>>> + xmin = csnap->xmin;
>>> +
>>> + count = csnap->xcnt;
>>> + subcount = csnap->subxcnt;
>>> + suboverflowed = csnap->suboverflowed;
>>> +
>>> + Assert(TransactionIdIsValid(globalxmin));
>>> + Assert(TransactionIdIsValid(xmin));
>>> + }
>>
>> Can we move this to a separate function? In fact, could you check how
>> much effort it'd be to split cached, !recovery, recovery cases into
>> three static inline helper functions? This is starting to be hard to
>> read, the indentation added in this patch doesn't help... Consider doing
>> recovery, !recovery cases in a preliminary separate patch.
>
> In this patch, I have moved the code 2 functions one to cache the
> snapshot data another to get the snapshot data from the cache. In the
> next patch, I will try to unify these things with recovery (standby)
> case. For recovery case, we are copying the xids from a simple xid
> array KnownAssignedXids[], I am not completely sure whether caching
> them bring similar performance benefits as above so I will also try to
> get a perf report for same.
>
>>
>>> + * Let only one of the caller cache the computed snapshot, and
>>> + * others can continue as before.
>>> */
>>> - if (!suboverflowed)
>>> + if (!ProcGlobal->is_snapshot_valid &&
>>> + LWLockConditionalAcquire(&ProcGlobal->CachedSnapshotLock,
>>> + LW_EXCLUSIVE))
>>> {
>>
>> I'd also move this to a helper function (the bit inside the lock).
>
> Fixed as suggested.
>
>>> +
>>> + /*
>>> + * The memory barrier has be to be placed here to ensure that
>>> + * is_snapshot_valid is set only after snapshot is cached.
>>> + */
>>> + pg_write_barrier();
>>> + ProcGlobal->is_snapshot_valid = true;
>>> + LWLockRelease(&ProcGlobal->CachedSnapshotLock);
>>
>> LWLockRelease() is a full memory barrier, so this shouldn't be required.
>
> Okay, Sorry for my understanding, Do you want me to set
> ProcGlobal->is_snapshot_valid after
> LWLockRelease(&ProcGlobal->CachedSnapshotLock)?
>
>>
>>> /*
>>> diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
>>> index 7dbaa81..7d06904 100644
>>> --- a/src/include/storage/proc.h
>>> +++ b/src/include/storage/proc.h
>>> @@ -16,6 +16,7 @@
>>>
>>> #include "access/xlogdefs.h"
>>> #include "lib/ilist.h"
>>> +#include "utils/snapshot.h"
>>> #include "storage/latch.h"
>>> #include "storage/lock.h"
>>> #include "storage/pg_sema.h"
>>> @@ -253,6 +254,22 @@ typedef struct PROC_HDR
>>> int startupProcPid;
>>> /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
>>> int startupBufferPinWaitBufId;
>>> +
>>> + /*
>>> + * In GetSnapshotData we can reuse the previously snapshot computed if no
>>> + * new transaction has committed or rolledback. Thus saving good amount of
>>> + * computation cycle under GetSnapshotData where we need to iterate
>>> + * through procArray every time we try to get the current snapshot. Below
>>> + * members help us to save the previously computed snapshot in global
>>> + * shared memory and any process which want to get current snapshot can
>>> + * directly copy from them if it is still valid.
>>> + */
>>> + SnapshotData cached_snapshot; /* Previously saved snapshot */
>>> + volatile bool is_snapshot_valid; /* is above snapshot valid */
>>> + TransactionId cached_snapshot_globalxmin; /* globalxmin when above
>>
>>
>> I'd rename is_snapshot_valid to 'cached_snapshot_valid' or such, it's
>> not clear from the name what it refers to.
>
> -- Renamed to cached_snapshot_valid.
>
>>
>>> + LWLock CachedSnapshotLock; /* A try lock to make sure only one
>>> + * process write to cached_snapshot */
>>> } PROC_HDR;
>>>
>>
>> Hm, I'd not add the lock to the struct, we normally don't do that for
>> the "in core" lwlocks that don't exist in a "configurable" number like
>> the buffer locks and such.
>>
>
> -- I am not really sure about this. Can you please help what exactly I
> should be doing here to get this corrected? Is that I have to add it
> to lwlocknames.txt as a new LWLock?
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-08-29 07:25:55 Re: Quorum commit for multiple synchronous replication.
Previous Message Simon Riggs 2017-08-29 06:58:51 Re: MAIN, Uncompressed?