Re: PATCH: two slab-like memory allocators

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, John Gorman <johngorman2(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: two slab-like memory allocators
Date: 2017-02-11 01:33:37
Message-ID: 20170211013337.iw32bvfqwta634wy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-02-11 02:13:59 +0100, Tomas Vondra wrote:
> On 02/09/2017 10:37 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2016-12-13 01:45:13 +0100, Tomas Vondra wrote:
> > > src/backend/utils/mmgr/Makefile | 2 +-
> > > src/backend/utils/mmgr/aset.c | 115 +--------------------------------
> > > src/backend/utils/mmgr/memdebug.c | 131 ++++++++++++++++++++++++++++++++++++++
> > > src/include/utils/memdebug.h | 22 +++++++
> > > 4 files changed, 156 insertions(+), 114 deletions(-)
> > > create mode 100644 src/backend/utils/mmgr/memdebug.c
> >
> > I'm a bit loathe to move these to a .c file - won't this likely make
> > these debugging tools even slower? Seems better to put some of them
> > them in a header as static inlines (not randomize, but the rest).
> >
>
> Do you have any numbers to support that? AFAICS compilers got really good in
> inlining stuff on their own.

Unless you use LTO, they can't inline across translation units. And
using LTO is slow enough for linking that it's not that much fun to use,
as it makes compile-edit-compile cycles essentially take as long as a
full rebuild.

> > > From 43aaabf70b979b172fd659ef4d0ef129fd78d72d Mon Sep 17 00:00:00 2001
> > > From: Tomas Vondra <tomas(at)2ndquadrant(dot)com>
> > > Date: Wed, 30 Nov 2016 15:36:23 +0100
> > > Subject: [PATCH 2/3] slab allocator
> > >
> > > ---
> > > src/backend/replication/logical/reorderbuffer.c | 82 +--
> > > src/backend/utils/mmgr/Makefile | 2 +-
> > > src/backend/utils/mmgr/slab.c | 803 ++++++++++++++++++++++++
> > > src/include/nodes/memnodes.h | 2 +-
> > > src/include/nodes/nodes.h | 1 +
> > > src/include/replication/reorderbuffer.h | 15 +-
> > > src/include/utils/memutils.h | 9 +
> >
> > I'd like to see the reorderbuffer changes split into a separate commit
> > from the slab allocator introduction.
> >
>
> I rather dislike patches that only add a bunch of code, without actually
> using it anywhere.

> But if needed, this is trivial to do at commit time - just don't
> commit the reorderbuffer bits.

Meh.

> > > + * Each block includes a simple bitmap tracking which chunks are used/free.
> > > + * This makes it trivial to check if all chunks on the block are free, and
> > > + * eventually free the whole block (which is almost impossible with a global
> > > + * freelist of chunks, storing chunks from all blocks).
> >
> > Why is checking a potentially somewhat long-ish bitmap better than a
> > simple counter, or a "linked list" of "next free chunk-number" or such
> > (where free chunks simply contain the id of the subsequent chunk)?
> > Using a list instead of a bitmap would also make it possible to get
> > 'lifo' behaviour, which is good for cache efficiency. A simple
> > chunk-number based singly linked list would only imply a minimum
> > allocation size of 4 - that seems perfectly reasonable?
> >
>
> A block-level counter would be enough to decide if all chunks on the block
> are free, but it's not sufficient to identify which chunks are free /
> available for reuse.
>
> The bitmap only has a single bit per chunk, so I find "potentially long-ish"
> is a bit misleading. Any linked list implementation will require much more
> per-chunk overhead - as the chunks are fixed-legth, it's possible to use
> chunk index (instead of 64-bit pointers), to save some space. But with large
> blocks / small chunks that's still at least 2 or 4 bytes per index, and
> you'll need two (to implement doubly-linked list, to make add/remove
> efficient).

> For example assume 8kB block and 64B chunks, i.e. 128 chunks. With bitmap
> that's 16B to track all free space on the block. Doubly linked list would
> require 1B per chunk index, 2 indexes per chunk. That's 128*2 = 256B.

> I have a hard time believing this the cache efficiency of linked lists
> (which may or may not be real in this case) out-weights this, but if you
> want to try, be my guest.

I'm not following - why would there be overhead in anything for
allocations bigger than 4 (or maybe 8) bytes? You can store the list
(via chunk ids, not pointers) inside the chunks itself, where data
otherwise would be. And I don't see why you'd need a doubly linked
list, as the only operations that are needed are to push to the front of
the list, and to pop from the front of the list - and both operations
are simple to do with a singly linked list?

> > Thirdly, isn't that approach going to result in a quite long freelists
> > array, when you have small items and a decent blocksize? That seems like
> > a fairly reasonable thing to do?
> >
>
> I'm confused. Why wouldn't that be reasonable. Or rather, what would be a
> more reasonable way?

If I understood correctly, you have one an array of doubly linked lists.
A block is stored in the list at the index #block's-free-elements. Is that
right?

If so, if you have e.g. 8 byte allocations and 64kb sized blocks, you
end up with an array of 1024 doubly linked lists, which'll take up 64kb
on its own. And there a certainly scenarios where even bigger block
sizes could make sense. That's both memory overhead, and runtime
overhead, because at reset-time we'll have to check the whole array
(which'll presumably largely be empty lists). Resetting is a pretty
common path...

> > > + /*
> > > + * We need to update index of the next free chunk on the block. If we used
> > > + * the last free chunk on this block, set it to chunksPerBlock (which is
> > > + * not a valid chunk index). Otherwise look for the next chunk - we know
> > > + * that it has to be above the current firstFreeChunk value, thanks to how
> > > + * we maintain firstFreeChunk here and in SlabFree().
> > > + */
> > > + if (block->nfree == 0)
> > > + block->firstFreeChunk = set->chunksPerBlock;
> > > + else
> > > + {
> > > + /* look for the next free chunk in the block, after the first one */
> > > + while ((++block->firstFreeChunk) < set->chunksPerBlock)
> > > + {
> > > + int byte = block->firstFreeChunk / 8;
> > > + int bit = block->firstFreeChunk % 8;
> > > +
> > > + /* stop when you find 0 (unused chunk) */
> > > + if (!(block->bitmapptr[byte] & (0x01 << bit)))
> > > + break;
> > > + }
> > > +
> > > + /* must have found the free chunk */
> > > + Assert(block->firstFreeChunk != set->chunksPerBlock);
> > > + }
> >
> > This and previous code just re-affirms my opinion that a bitmap is not
> > the best structure here.
> >
>
> It'd be great if you could explain why, instead of just making such claims
> ...

Because it's complicated. This is a fair bit of code and branches to
run in a pretty hot path.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-02-11 01:39:55 Re: amcheck (B-Tree integrity checking tool)
Previous Message Tomas Vondra 2017-02-11 01:17:23 Re: multivariate statistics (v19)