Re: Reducing the chunk header sizes on all memory context types

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Reducing the chunk header sizes on all memory context types
Date: 2022-10-06 23:10:33
Message-ID: 2983672.1665097833@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> FreeBSD 13.0, arm64: Usually the low-order nibble is 0000 or 1111,
> but for some smaller values of N it sometimes comes up as 0010.
> NetBSD 9.2, amd64: results similar to FreeBSD.

I looked into NetBSD's malloc.c, and what I discovered is that
their implementation doesn't have any chunk headers: chunks of
the same size are allocated consecutively within pages, and all
the bookkeeping data is somewhere else. Presumably FreeBSD is
the same. So the apparent special case with 0010 is an illusion,
even though I saw it on two different machines (maybe it's a
specific value that we're allocating??) The most likely case
is 0000 due to the immediately previous word having never been
used (note that like palloc, they round chunk sizes up to powers
of two, so unused space at the end of a chunk is common). I'm
not sure whether the cases I saw with 1111 are chance artifacts
or reflect some real mechanism, but probably the former. I
thought for a bit that that might be the effects of wipe_mem
on the previous chunk, but palloc'd storage would never share
the same page as malloc'd storage under this allocator, because
we grab it from malloc in larger-than-page chunks.

However ... after looking into glib's malloc.c, I find that
it does use a chunk header, and very conveniently the three bits
that we care about are flag bits (at least on 64-bit machines):

chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk, if unallocated (P clear) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of chunk, in bytes |A|M|P|
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| User data starts here... .

The A bit is only used when threading, and hence should always
be zero in our usage. The M bit only gets set in chunks large
enough to be separately mmap'd, so when it is set P must be 0.
If M is not set then P seems to usually be 1, although it could
be 0. So the three possibilities for what we can see under
glibc are 000, 001, 010 (the last only occuring for chunks
larger than 128K). This squares with experimental results on
my machine --- I'd not thought to try sizes above 100K before.

So I'm still inclined to leave 001 and 010 both unused, but the
reason why is different than I thought before.

Going forward, we could commandeer 010 if we need to without losing
very much debuggability, since malloc'ing more than 128K in a chunk
won't happen often.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2022-10-06 23:15:19 Refactor to introduce pg_strcoll().
Previous Message Andres Freund 2022-10-06 22:59:49 Re: START_REPLICATION SLOT causing a crash in an assert build