Re: PATCH: two slab-like memory allocators

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, John Gorman <johngorman2(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: two slab-like memory allocators
Date: 2017-02-11 01:13:59
Message-ID: b72dfa3d-c241-8a3e-97ad-bf214b754aca@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/09/2017 10:37 PM, Andres Freund wrote:
> Hi,
>
> On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:
>> src/backend/utils/mmgr/Makefile | 2 +-
>> src/backend/utils/mmgr/aset.c | 115 +--------------------------------
>> src/backend/utils/mmgr/memdebug.c | 131 ++++++++++++++++++++++++++++++++++++++
>> src/include/utils/memdebug.h | 22 +++++++
>> 4 files changed, 156 insertions(+), 114 deletions(-)
>> create mode 100644 src/backend/utils/mmgr/memdebug.c
>
> I'm a bit loathe to move these to a .c file - won't this likely make
> these debugging tools even slower? Seems better to put some of them
> them in a header as static inlines (not randomize, but the rest).
>

Do you have any numbers to support that? AFAICS compilers got really
good in inlining stuff on their own.

>
>> From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas(at)2ndquadrant(dot)com>
>> Date: Wed, 30 Nov 2016 15:36:23 +0100
>> Subject: [PATCH 2/3] slab allocator
>>
>> ---
>> src/backend/replication/logical/reorderbuffer.c | 82 +--
>> src/backend/utils/mmgr/Makefile | 2 +-
>> src/backend/utils/mmgr/slab.c | 803 ++++++++++++++++++++++++
>> src/include/nodes/memnodes.h | 2 +-
>> src/include/nodes/nodes.h | 1 +
>> src/include/replication/reorderbuffer.h | 15 +-
>> src/include/utils/memutils.h | 9 +
>
> I'd like to see the reorderbuffer changes split into a separate commit
> from the slab allocator introduction.
>

I rather dislike patches that only add a bunch of code, without actually
using it anywhere. But if needed, this is trivial to do at commit time -
just don't commit the reorderbuffer bits.

>
>
>
>> +/*-------------------------------------------------------------------------
>> + *
>> + * slab.c
>> + * SLAB allocator definitions.
>> + *
>> + * SLAB is a custom MemoryContext implementation designed for cases of
>> + * equally-sized objects.
>> + *
>> + *
>> + * Portions Copyright (c) 2016, PostgreSQL Global Development Group
>
> Bump, before a committer forgets it.
>

OK.

>
>> + * IDENTIFICATION
>> + * src/backend/utils/mmgr/slab.c
>> + *
>> + *
>> + * The constant allocation size allows significant simplification and various
>> + * optimizations. Firstly, we can get rid of the doubling and carve the blocks
>> + * into chunks of exactly the right size (plus alignment), not wasting memory.
>
> Getting rid of it relative to what? I'd try to phrase it so these
> comments stand on their own.
>

OK, rill reword.

>
>> + * Each block includes a simple bitmap tracking which chunks are used/free.
>> + * This makes it trivial to check if all chunks on the block are free, and
>> + * eventually free the whole block (which is almost impossible with a global
>> + * freelist of chunks, storing chunks from all blocks).
>
> Why is checking a potentially somewhat long-ish bitmap better than a
> simple counter, or a "linked list" of "next free chunk-number" or such
> (where free chunks simply contain the id of the subsequent chunk)?
> Using a list instead of a bitmap would also make it possible to get
> 'lifo' behaviour, which is good for cache efficiency. A simple
> chunk-number based singly linked list would only imply a minimum
> allocation size of 4 - that seems perfectly reasonable?
>

A block-level counter would be enough to decide if all chunks on the
block are free, but it's not sufficient to identify which chunks are
free / available for reuse.

The bitmap only has a single bit per chunk, so I find "potentially
long-ish" is a bit misleading. Any linked list implementation will
require much more per-chunk overhead - as the chunks are fixed-legth,
it's possible to use chunk index (instead of 64-bit pointers), to save
some space. But with large blocks / small chunks that's still at least 2
or 4 bytes per index, and you'll need two (to implement doubly-linked
list, to make add/remove efficient).

For example assume 8kB block and 64B chunks, i.e. 128 chunks. With
bitmap that's 16B to track all free space on the block. Doubly linked
list would require 1B per chunk index, 2 indexes per chunk. That's 128*2
= 256B.

I have a hard time believing this the cache efficiency of linked lists
(which may or may not be real in this case) out-weights this, but if you
want to try, be my guest.

>
>> + * At the context level, we use 'freelist' to track blocks ordered by number
>> + * of free chunks, starting with blocks having a single allocated chunk, and
>> + * with completely full blocks on the tail.
>
> Why that way round? Filling chunks up as much as possible is good for
> cache and TLB efficiency, and allows for earlier de-allocation of
> partially used blocks? Oh, I see you do that in the next comment,
> but it still leaves me wondering.
>
> Also, is this actually a list? It's more an array of lists, right?
> I.e. it should be named freelists?
>

Possibly. Naming things is hard.

>
> Thirdly, isn't that approach going to result in a quite long freelists
> array, when you have small items and a decent blocksize? That seems like
> a fairly reasonable thing to do?
>

I'm confused. Why wouldn't that be reasonable. Or rather, what would be
a more reasonable way?

>
>> + * This also allows various optimizations - for example when searching for
>> + * free chunk, we the allocator reuses space from the most full blocks first,
>> + * in the hope that some of the less full blocks will get completely empty
>> + * (and returned back to the OS).
>
> Might be worth mentioning tlb/cache efficiency too.
>

I haven't really considered tlb/cache very much. The main goal of this
design was to free blocks (instead of keeping many partially-used blocks
around). If you have comments on this, feel free to add them.

>
>> + * For each block, we maintain pointer to the first free chunk - this is quite
>> + * cheap and allows us to skip all the preceding used chunks, eliminating
>> + * a significant number of lookups in many common usage patters. In the worst
>> + * case this performs as if the pointer was not maintained.
>
> Hm, so that'd be eliminated if we maintained a linked list of chunks (by
> chunk number) and a free_chunk_cnt or such.
>

As I explained above, I don't think linked list is a good solution. IIRC
correctly I've initially done that, and ended using the bitmap. If you
have idea how to do that, feel free to implement that and then we can do
some measurements and compare the patches.

>
>> +
>> +#include "postgres.h"
>> +
>> +#include "utils/memdebug.h"
>> +#include "utils/memutils.h"
>> +#include "lib/ilist.h"
>
> Move ilist up, above memdebug, so the list is alphabetically ordered.
>

OK

>
>> +/*
>> + * SlabPointer
>> + * Aligned pointer which may be a member of an allocation set.
>> + */
>> +typedef void *SlabPointer;
>> +typedef SlabContext *Slab;
>
> I personally wont commit this whith pointer hiding typedefs. If
> somebody else does, I can live with it, but for me it's bad enough taste
> that I wont.
>

Meh.

>
>> +/*
>> + * SlabContext is a specialized implementation of MemoryContext.
>> + */
>> +typedef struct SlabContext
>> +{
>> + MemoryContextData header; /* Standard memory-context fields */
>> + /* Allocation parameters for this context: */
>> + Size chunkSize; /* chunk size */
>> + Size fullChunkSize; /* chunk size including header and alignment */
>> + Size blockSize; /* block size */
>> + int chunksPerBlock; /* number of chunks per block */
>> + int minFreeChunks; /* min number of free chunks in any block */
>> + int nblocks; /* number of blocks allocated */
>> + /* blocks with free space, grouped by number of free chunks: */
>> + dlist_head freelist[FLEXIBLE_ARRAY_MEMBER];
>> +} SlabContext;
>> +
>
> Why aren't these ints something unsigned?
>

Yeah, some of those could be unsigned. Will check.

>
>> +/*
>> + * SlabIsValid
>> + * True iff set is valid allocation set.
>> + */
>> +#define SlabIsValid(set) PointerIsValid(set)
>
> It's not your fault, but this "iff" is obviously a lot stronger than the
> actual test ;). I seriously doubt this macro is worth anything...
>

Yeah.

>
>> +/*
>> + * SlabReset
>> + * Frees all memory which is allocated in the given set.
>> + *
>> + * The code simply frees all the blocks in the context - we don't keep any
>> + * keeper blocks or anything like that.
>> + */
>
> Why don't we? Seems quite worthwhile? Thinking about this, won't this
> result in a drastic increase of system malloc/mmap/brk traffic when
> there's lot of short transactions in reorderbuffer?
>

I haven't seen any significant impact of that during the tests I've done
(even with many tiny transactions), but it adding keeper blocks should
be trivial.

>
>> +static void
>> +SlabReset(MemoryContext context)
>> +{
>> + /* walk over freelists and free the blocks */
>> + for (i = 0; i <= set->chunksPerBlock; i++)
>> + {
>> + dlist_mutable_iter miter;
>> +
>> + dlist_foreach_modify(miter, &set->freelist[i])
>> + {
>> + SlabBlock block = dlist_container(SlabBlockData, node, miter.cur);
>> +
>> + dlist_delete(miter.cur);
>> +
>> + /* Normal case, release the block */
>
> What does "normal case" refer to here? Given that there's no alternative
> case...
>

Mem, bogus comment.

>
>> + /*
>> + * We need to update index of the next free chunk on the block. If we used
>> + * the last free chunk on this block, set it to chunksPerBlock (which is
>> + * not a valid chunk index). Otherwise look for the next chunk - we know
>> + * that it has to be above the current firstFreeChunk value, thanks to how
>> + * we maintain firstFreeChunk here and in SlabFree().
>> + */
>> + if (block->nfree == 0)
>> + block->firstFreeChunk = set->chunksPerBlock;
>> + else
>> + {
>> + /* look for the next free chunk in the block, after the first one */
>> + while ((++block->firstFreeChunk) < set->chunksPerBlock)
>> + {
>> + int byte = block->firstFreeChunk / 8;
>> + int bit = block->firstFreeChunk % 8;
>> +
>> + /* stop when you find 0 (unused chunk) */
>> + if (!(block->bitmapptr[byte] & (0x01 << bit)))
>> + break;
>> + }
>> +
>> + /* must have found the free chunk */
>> + Assert(block->firstFreeChunk != set->chunksPerBlock);
>> + }
>
> This and previous code just re-affirms my opinion that a bitmap is not
> the best structure here.
>

It'd be great if you could explain why, instead of just making such
claims ...

>
>> + /* move the whole block to the right place in the freelist */
>> + dlist_delete(&block->node);
>> + dlist_push_head(&set->freelist[block->nfree], &block->node);
>
> Hm. What if we, instead of the array of doubly linked lists, just kept
> a single linked list of blocks, and keep that list sorted by number of
> free chunks? Given that freeing / allocation never changes the number
> of allocated chunks by more than 1, we'll never have to move an entry
> far in that list to keep it sorted.
>

Only assuming that there'll be only few blocks with the same number of
free chunks. If that's not the case, you'll have to walk many blocks to
move the block to the right place in the list. The array of lists
handles such cases way more efficiently, and I think we should keep it.

>
>> +/*
>> + * SlabRealloc
>> + * As Slab is designed for allocating equally-sized chunks of memory, it
>> + * can't really do an actual realloc.
>> + *
>> + * We try to be gentle and allow calls with exactly the same size as in that
>> + * case we can simply return the same chunk. When the size differs, we fail
>> + * with assert failure or return NULL.
>> + *
>> + * We might be even support cases with (size < chunkSize). That however seems
>> + * rather pointless - Slab is meant for chunks of constant size, and moreover
>> + * realloc is usually used to enlarge the chunk.
>> + *
>> + * XXX Perhaps we should not be gentle at all and simply fails in all cases,
>> + * to eliminate the (mostly pointless) uncertainty.
>
> I think I'm in favor of that. This seems more likely to hide a bug than
> actually helpful.
>

OK.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-02-11 01:17:23 Re: multivariate statistics (v19)
Previous Message Tomas Vondra 2017-02-11 00:38:54 Re: Checksums by default?