|From:||Tomas Vondra <tv(at)fuzzy(dot)cz>|
|Subject:||Re: 9.5: Better memory accounting, towards memory-bounded HashAgg|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On 17.11.2014 19:46, Andres Freund wrote:
> On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote:
>> On 17.11.2014 18:04, Andres Freund wrote:
>>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote:
>>>> *** a/src/include/nodes/memnodes.h
>>>> --- b/src/include/nodes/memnodes.h
>>>> *** 60,65 **** typedef struct MemoryContextData
>>>> --- 60,66 ----
>>>> MemoryContext nextchild; /* next child of same parent */
>>>> char *name; /* context name (just for debugging) */
>>>> bool isReset; /* T = no space alloced since last reset */
>>>> + uint64 mem_allocated; /* track memory allocated for this context */
>>>> #ifdef USE_ASSERT_CHECKING
>>>> bool allowInCritSection; /* allow palloc in critical section */
>>> That's quite possibly one culprit for the slowdown. Right now one
>>> AllocSetContext struct fits precisely into three cachelines. After
>>> your change it doesn't anymore.
>> I'm no PPC64 expert, but I thought the cache lines are 128 B on that
>> platform, since at least Power6?
> Hm, might be true.
>> Also, if I'm counting right, the MemoryContextData structure is 56B
>> without the 'mem_allocated' field (and without the allowInCritSection),
>> and 64B with it (at that particular place). sizeof() seems to confirm
>> that. (But I'm on x86, so maybe the alignment on PPC64 is different?).
> The MemoryContextData struct is embedded into AllocSetContext.
Oh, right. That makes is slightly more complicated, though, because
AllocSetContext adds 6 x 8B fields plus an in-line array of freelist
pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the
additional field. There might be some difference because of alignment,
but I still don't see how that one additional field might impact cachelines?
But if we separated the freelist, that might actually make it faster, at
least for calls that don't touch the freelist at all, no? Because most
of the palloc() calls will be handled from the current block.
I tried this on the patch I sent on 23/8 (with the MemoryAccounting
hierarchy parallel to the contexts), and I got this (running 10x the
reindex, without restarts or such):
without the patch
... 1723933 KB used: CPU 1.35s/4.33u sec elapsed 10.38 sec
... 1723933 KB used: CPU 1.39s/4.19u sec elapsed 9.89 sec
... 1723933 KB used: CPU 1.39s/4.26u sec elapsed 9.90 sec
... 1723933 KB used: CPU 1.45s/4.22u sec elapsed 10.15 sec
... 1723933 KB used: CPU 1.36s/4.20u sec elapsed 9.84 sec
... 1723933 KB used: CPU 1.32s/4.20u sec elapsed 9.96 sec
... 1723933 KB used: CPU 1.50s/4.23u sec elapsed 9.91 sec
... 1723933 KB used: CPU 1.47s/4.24u sec elapsed 9.90 sec
... 1723933 KB used: CPU 1.38s/4.23u sec elapsed 10.09 sec
... 1723933 KB used: CPU 1.37s/4.19u sec elapsed 9.84 sec
with the patch (no tracking enabled)
... 1723933 KB used: CPU 1.40s/4.28u sec elapsed 10.79 sec
... 1723933 KB used: CPU 1.39s/4.32u sec elapsed 10.14 sec
... 1723933 KB used: CPU 1.40s/4.16u sec elapsed 9.99 sec
... 1723933 KB used: CPU 1.32s/4.21u sec elapsed 10.16 sec
... 1723933 KB used: CPU 1.37s/4.34u sec elapsed 10.03 sec
... 1723933 KB used: CPU 1.35s/4.22u sec elapsed 9.97 sec
... 1723933 KB used: CPU 1.43s/4.20u sec elapsed 9.99 sec
... 1723933 KB used: CPU 1.39s/4.17u sec elapsed 9.71 sec
... 1723933 KB used: CPU 1.44s/4.16u sec elapsed 9.91 sec
... 1723933 KB used: CPU 1.40s/4.25u sec elapsed 9.99 sec
So it's about 9,986 vs 10,068 for averages, and 9,905 vs 9,99 for a
median. I.e. ~0.8% difference. That's on x86, though. I wonder what'd be
the effect on the PPC machine.
Patch attached - it's not perfect, though. For example it should free
the freelists at some point, but I feel a bit lazy today.
|Next Message||Peter Geoghegan||2014-11-17 20:04:33||Re: Doing better at HINTing an appropriate column within errorMissingColumn()|
|Previous Message||Robert Haas||2014-11-17 19:32:01||Re: group locking: incomplete patch, just for discussion|