Re: PATCH: two slab-like memory allocators

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, John Gorman <johngorman2(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: two slab-like memory allocators
Date: 2017-02-11 13:40:18
Message-ID: 53def906-2197-bc06-b987-d237f9574800@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/11/2017 02:33 AM, Andres Freund wrote:
> On 2017-02-11 02:13:59 +0100, Tomas Vondra wrote:
>> On 02/09/2017 10:37 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:
>>>> src/backend/utils/mmgr/Makefile | 2 +-
>>>> src/backend/utils/mmgr/aset.c | 115 +--------------------------------
>>>> src/backend/utils/mmgr/memdebug.c | 131 ++++++++++++++++++++++++++++++++++++++
>>>> src/include/utils/memdebug.h | 22 +++++++
>>>> 4 files changed, 156 insertions(+), 114 deletions(-)
>>>> create mode 100644 src/backend/utils/mmgr/memdebug.c
>>>
>>> I'm a bit loathe to move these to a .c file - won't this likely make
>>> these debugging tools even slower? Seems better to put some of them
>>> them in a header as static inlines (not randomize, but the rest).
>>>
>>
>> Do you have any numbers to support that? AFAICS compilers got really good in
>> inlining stuff on their own.
>
> Unless you use LTO, they can't inline across translation units. And
> using LTO is slow enough for linking that it's not that much fun to use,
> as it makes compile-edit-compile cycles essentially take as long as a
> full rebuild.
>
>
>>>> From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
>>>> From: Tomas Vondra <tomas(at)2ndquadrant(dot)com>
>>>> Date: Wed, 30 Nov 2016 15:36:23 +0100
>>>> Subject: [PATCH 2/3] slab allocator
>>>>
>>>> ---
>>>> src/backend/replication/logical/reorderbuffer.c | 82 +--
>>>> src/backend/utils/mmgr/Makefile | 2 +-
>>>> src/backend/utils/mmgr/slab.c | 803 ++++++++++++++++++++++++
>>>> src/include/nodes/memnodes.h | 2 +-
>>>> src/include/nodes/nodes.h | 1 +
>>>> src/include/replication/reorderbuffer.h | 15 +-
>>>> src/include/utils/memutils.h | 9 +
>>>
>>> I'd like to see the reorderbuffer changes split into a separate commit
>>> from the slab allocator introduction.
>>>
>>
>> I rather dislike patches that only add a bunch of code, without actually
>> using it anywhere.
>
>> But if needed, this is trivial to do at commit time - just don't
>> commit the reorderbuffer bits.
>
> Meh.
>
>
>>>> + * Each block includes a simple bitmap tracking which chunks are used/free.
>>>> + * This makes it trivial to check if all chunks on the block are free, and
>>>> + * eventually free the whole block (which is almost impossible with a global
>>>> + * freelist of chunks, storing chunks from all blocks).
>>>
>>> Why is checking a potentially somewhat long-ish bitmap better than a
>>> simple counter, or a "linked list" of "next free chunk-number" or such
>>> (where free chunks simply contain the id of the subsequent chunk)?
>>> Using a list instead of a bitmap would also make it possible to get
>>> 'lifo' behaviour, which is good for cache efficiency. A simple
>>> chunk-number based singly linked list would only imply a minimum
>>> allocation size of 4 - that seems perfectly reasonable?
>>>
>>
>> A block-level counter would be enough to decide if all chunks on the block
>> are free, but it's not sufficient to identify which chunks are free /
>> available for reuse.
>>
>> The bitmap only has a single bit per chunk, so I find "potentially long-ish"
>> is a bit misleading. Any linked list implementation will require much more
>> per-chunk overhead - as the chunks are fixed-legth, it's possible to use
>> chunk index (instead of 64-bit pointers), to save some space. But with large
>> blocks / small chunks that's still at least 2 or 4 bytes per index, and
>> you'll need two (to implement doubly-linked list, to make add/remove
>> efficient).
>
>> For example assume 8kB block and 64B chunks, i.e. 128 chunks. With bitmap
>> that's 16B to track all free space on the block. Doubly linked list would
>> require 1B per chunk index, 2 indexes per chunk. That's 128*2 = 256B.
>
>> I have a hard time believing this the cache efficiency of linked lists
>> (which may or may not be real in this case) out-weights this, but if you
>> want to try, be my guest.
>
> I'm not following - why would there be overhead in anything for
> allocations bigger than 4 (or maybe 8) bytes? You can store the list
> (via chunk ids, not pointers) inside the chunks itself, where data
> otherwise would be. And I don't see why you'd need a doubly linked
> list, as the only operations that are needed are to push to the front of
> the list, and to pop from the front of the list - and both operations
> are simple to do with a singly linked list?
>

Oh! I have not considered storing the chunk indexes (for linked lists)
in the chunk itself, which obviously eliminates the overhead concerns,
and you're right a singly-linked list should be enough.

There will be some minimum-chunk-size requirement, depending on the
block size/chunk size. I wonder whether it makes sense to try to be
smart and make it dynamic, so that we only require 1B or 2B for cases
when only that many chunks fit into a block, or just say that it's 4B
and be done with it.

I mean 2^32 chunks ought to be enough for anyone, right?

I'm still not buying the cache efficiency argument, though. One of the
reasons is that the implementation prefers blocks with fewer free chunks
when handling palloc(), so pfree() is making the block less likely to be
chosen by the next palloc().

>
>>> Thirdly, isn't that approach going to result in a quite long freelists
>>> array, when you have small items and a decent blocksize? That seems like
>>> a fairly reasonable thing to do?
>>>
>>
>> I'm confused. Why wouldn't that be reasonable. Or rather, what would be a
>> more reasonable way?
>
> If I understood correctly, you have one an array of doubly linked lists.
> A block is stored in the list at the index #block's-free-elements. Is that
> right?
>
> If so, if you have e.g. 8 byte allocations and 64kb sized blocks, you
> end up with an array of 1024 doubly linked lists, which'll take up 64kb
> on its own. And there a certainly scenarios where even bigger block
> sizes could make sense. That's both memory overhead, and runtime
> overhead, because at reset-time we'll have to check the whole array
> (which'll presumably largely be empty lists). Resetting is a pretty
> common path...
>

True, but it's not entirely clear if resetting is common for the paths
where we use those new allocators.

Also, if we accept that it might be a problem, what other solution you
propose? I don't think just merging everything into a single list is a
good idea, for the reasons I explained before (it might make the resets
somewhat less expensive, but it'll make pfree() more expensive).

What might work is replacing the array with a list, though. So we'd have
a list of lists, which would eliminate the array overhead.

>
>>>> + /*
>>>> + * We need to update index of the next free chunk on the block. If we used
>>>> + * the last free chunk on this block, set it to chunksPerBlock (which is
>>>> + * not a valid chunk index). Otherwise look for the next chunk - we know
>>>> + * that it has to be above the current firstFreeChunk value, thanks to how
>>>> + * we maintain firstFreeChunk here and in SlabFree().
>>>> + */
>>>> + if (block->nfree == 0)
>>>> + block->firstFreeChunk = set->chunksPerBlock;
>>>> + else
>>>> + {
>>>> + /* look for the next free chunk in the block, after the first one */
>>>> + while ((++block->firstFreeChunk) < set->chunksPerBlock)
>>>> + {
>>>> + int byte = block->firstFreeChunk / 8;
>>>> + int bit = block->firstFreeChunk % 8;
>>>> +
>>>> + /* stop when you find 0 (unused chunk) */
>>>> + if (!(block->bitmapptr[byte] & (0x01 << bit)))
>>>> + break;
>>>> + }
>>>> +
>>>> + /* must have found the free chunk */
>>>> + Assert(block->firstFreeChunk != set->chunksPerBlock);
>>>> + }
>>>
>>> This and previous code just re-affirms my opinion that a bitmap is not
>>> the best structure here.
>>>
>>
>> It'd be great if you could explain why, instead of just making such claims
>> ...
>
> Because it's complicated. This is a fair bit of code and branches to
> run in a pretty hot path.
>

Hmm. I admit updating the index of the first free chunk is a bit
cumbersome, and the linked list would make it unnecessary.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2017-02-11 14:04:55 Re: Should we cacheline align PGXACT?
Previous Message Tomas Vondra 2017-02-11 13:21:53 Re: LWLock optimization for multicore Power machines