Re: making use of large TLB pages

From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making use of large TLB pages
Date: 2002-09-28 05:30:45
Message-ID: 87ptuywuii.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Okay, I did some more research into this area. It looks like it will
be feasible to use large TLB pages for PostgreSQL.

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> It wasn't clear from your description whether large-TLB shmem segments
> even have IDs that one could use to determine whether "the segment still
> exists".

There are two types of hugepages:

(a) private: Not shared on fork(), not accessible to processes
other than the one that allocates the pages.

(b) shared: Shared across a fork(), accessible to other
processes: different processes can access the same segment
if they call sys_alloc_hugepages() with the same key.

So for a standalone backend, we can just use private pages (probably
worth using private hugepages rather than malloc, although I doubt it
matters much either way).

> > Another possibility might be to still allocate a small SysV shmem
> > area, and use that to provide the interlock, while we allocate the
> > buffer area using sys_alloc_hugepages. That's somewhat of a hack, but
> > I think it would resolve the interlock problem, at least.
>
> Not a bad idea ... I have not got a better one offhand ... but watch
> out for SHMMIN settings.

As it turns out, this will be completely unnecessary. Since hugepages
are an in-kernel data structure, the kernel takes care of ensuring
that dieing processes don't orphan any unused hugepage segments. The
logic works like this: (for shared hugepages)

(a) sys_alloc_hugepages() without IPC_EXCL will return a
pointer to an existing segment, if there is one that
matches the key. If an existing segment is found, the
usage counter for that segment is incremented. If no
matching segment exists, an error is returned. (I'm pretty
sure the usage counter is also incremented after a fork(),
but I'll double-check that.)

(b) sys_free_hugepages() decrements the usage counter

(c) when a process that has allocated a shared hugepage dies
for *any reason* (even kill -9), the usage counter is
decremented

(d) if the usage counter for a given segment ever reaches
zero, the segment is deleted and the memory is free'd.

If we used a key that would remain the same between runs of the
postmaster, this should ensure that there isn't a possibility of two
independant sets of backends operating on the same data dir. The most
logical way to do this IMHO would be to just hash the data dir, but I
suppose the current method of using the port number should work as
well.

To elaborate on (a) a bit, we'd want to use this logic when allocating
a new set of hugepages on postmaster startup:

(1) call sys_alloc_hugepages() without IPC_EXCL. If it returns
an error, we're in the clear: there's no page matching
that key. If it returns a pointer to a previously existing
segment, panic: it is very likely that there are some
orphaned backends still active.

(2) If the previous call didn't find anything, call
sys_alloc_hugepages() again, specifying IPC_EXCL to create
a new segment.

Now, the question is: how should this be implemented? You recently
did some of the legwork toward supporting different APIs for shared
memory / semaphores, which makes this work easier -- unfortunately,
some additional stuff is still needed. Specifically, support for
hugepages is a configuration option, that may or may not be enabled
(if it's disabled, the syscall returns a specific error). So I believe
the logic is something like:

- if compiling on a Linux system, enable support for hugepages
(the regular SysV stuff is still needed as a backup)

- if we're compiling on a Linux system but the kernel headers
don't define the syscalls we need, use some reasonable
defaults (e.g. the syscall numbers for the current hugepage
syscalls in Linux 2.5)

- at runtime, try to make one of these syscalls. If it fails,
fall back to the SysV stuff.

Does that sound reasonable?

Any other comments would be appreciated.

Cheers,

Neil

--
Neil Conway <neilc(at)samurai(dot)com> || PGP Key ID: DB3C29FC

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Clift 2002-09-28 07:08:30 How to REINDEX in high volume environments?
Previous Message Bruce Momjian 2002-09-28 05:04:11 Re: Reconstructing FKs in pg_dump