EXPERIMENTAL: mmap-based memory context / allocator

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: EXPERIMENTAL: mmap-based memory context / allocator
Date: 2015-02-15 18:57:40
Message-ID: 54E0EC24.3010708@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

While playing with the memory context internals some time ago, I've been
wondering if there are better ways to allocate memory from the kernel -
either tweaking libc somehow, or maybe interacting with kernel directly.

I mostly forgot about that topic, but after the local conference last
week we went to a pub and one of the things we discussed over a beer was
how complex and unintuitive the memory stuff is, because of the libc
heuristics, 'sbrk' properties [1] and behavior in the presence of holes,
OOM etc.

The virtual memory system should handle this to a large degree, but I've
repeatedly ran into problem when that was not the case (for example the
memory is still part of the virtual address space, and thus counted by OOM).

One of the wilder ideas (I mentined beer was involved!) was a memory
allocator based on mmap [2], bypassing the libc malloc implementation
altogether. mmap() has some nice features (e.g. no issues with returning
memory back to the kernel, which may be problem with sbrk). So I hacked
a bit and switched the AllocSet implementation to mmap().

And it works surprisingly well, so here is an experimental patch for
comments whether this really is a good idea etc. Some parts of the patch
are a bit dirty and I've only tested it on x86.

Notes
-----

(1) The main changes are mostly trivial, rewriting malloc()/free() to
mmap()/munmap(), plus related chages (e.g. mmap() returns (void*)-1
instead of NULL in case of failure).

(2) A significant difference is that mmap() can't allocate blocks
smaller than page size, which is 4kB on x86. This means the
smallest context is 4kB (instead of 1kB), and also affects the
growth of block size (it can only grow to 8kB). So this changes
the AllocSet internal behavior a bit.

(3) As this bypasses libc, it can't use the libc freelists (which are
used by malloc). To compensate for this, there's a simple
process-level freelist of blocks, shared by all memory contexts.
This cache a limited capacity (roughly 4MB per).

(4) Some of the comments are obsolete, still referencing malloc/free.

Benchmarks
----------

I've done extensive testing and also benchmrking, and it seems to be no
slower than the current implementation, and in some cases is actually a
bit faster.

a) time pgbench -i -s 300

- pgbench initialisation, measuring the COPY and the total duration.
- averages of 3 runs (negligible variations between runs)

COPY total
---------------------------------
master 0:26.22 1:22
mmap 0:26.35 1:22

Pretty much no difference.

b) pgbench -S -c 8 -j 8 -T 60

- short read-only runs (1 minute) after a warmup
- min, max, average of 8 runs

average min max
-------------------------------------
master 96785 95329 97981
mmap 98279 97259 99671

That's a rather consistent 1-2% improvement.

c) REINDEX pgbench_accounts_pkey

- large maintenance_work_mem so that it's in-memory sort
- average, min, max of 8 runs (duration in seconds)

average min max
-------------------------------------
master 10.35 9.64 13.56
mmap 9.85 9.81 9.90

Again, mostly improvement, except for the minimum where the currect
memory context was a bit faster. But overall the mmap-based one is
much more consistent.

Some of the differences may be due to allocating 4kB blocks from the
very start (while the current allocator starts with 1kB, then 2kB and
finally 4kB).

Ideas, opinions?

[1] http://linux.die.net/man/2/sbrk
[2] http://linux.die.net/man/2/mmap

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
mmap-allocator-v1-wip.patch text/x-diff 13.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-02-15 18:59:05 Re: Allow "snapshot too old" error, to prevent bloat
Previous Message Petr Jelinek 2015-02-15 18:40:24 Re: Sequence Access Method WIP