Re: same-address mappings vs. relative pointers

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: same-address mappings vs. relative pointers
Date: 2013-12-05 12:44:27
Message-ID: CA+Tgmoanps9DAmVb7XUNP5nrV01A_PzpvsLSjJKnGG_Azx2Zog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi Robert,
>
> On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
>> But I'm also learning painfully that this kind of thing only goes so
>> far. For example, I spent some time looking at what it would take to
>> provide a dynamic shared memory equivalent of palloc/pfree, a facility
>> that I feel fairly sure would attract a few fans.
>
> I have to admit, I don't fully see the point in a real palloc()
> equivalent for shared memory. If you're allocating shared memory in a
> granular fashion at somewhat high frequency you're doing something
> *wrong*. And the result is not going to be scalable.

Why? Lots of people have written lots of programs that do just that.
I agree it's overdone, but saying it should never happen seems like an
over-rotation in the opposite direction.

In any case, the application that led me to this was actually parallel
internal sort. Let's suppose we had an implementation of parallel
internal sort. You'd load all the tuples you want to sort into
dynamic shared memory (just as we now load them into palloc'd chunks),
start a bunch of workers, and the original backend and the
newly-started workers would cooperate to sort your data. Win.

It's fairly clear that, technically speaking, you do NOT need a real
allocator for this. You can just shove the first tuple that you store
into shared memory, say right up against the end of the region. Then
you can place the next tuple immediately before it and the following
one just before that, and so on. This is very simple to implement and
also extremely memory-efficient, so superficially it's quite an
appealing approach.

But now let's suppose the input data was estimated to fit in work_mem
but in the end did not, and therefore we need to instead do an
external sort. That means we're going to start evicting tuples from
memory and to make room for new tuples we read from the input stream.
Well, now our strategy of tight-packing everything does not look so
good, because we have no way of tracking which tuples we no longer
need and reusing that space for new tuples. If external sort is not
something we know how to do in parallel, we can potentially copy all
of the tuples out of the dynamic shared memory segment and back to
backend-private memory in palloc'd chunks, and then deallocate the
dynamic shared memory segment... but that's potentially expensive if
work_mem is large, and it temporarily uses double what we're allowed
by work_mem.

And suppose we want to do a parallel external sort with, say, the
worker backend pushing tuples into a heap stored in the dynamic shared
memory segment and other backends writing them out to tapes and
removing them from the heap. You can't do that without some kind of
space management, i.e. an allocator.

> This is one of those times where live really, really would be easier if
> we were using c++. Just have small smart-pointer wrapper, and we
> wouldn't need to worry too much.

Yes, a template class would be perfect here.

>> I still have mixed feelings about the idea of same-address mappings.
>> On the one hand, on 64-bit architectures, there's a huge amount of
>> address space available. Assuming the OS does something even vaguely
>> sensible in terms of laying out the text, heap, stack, and shared
>> library mappings, there ought to be many petabytes of address space
>> that never gets touched, and it's hard to see why we couldn't find
>> some place in there to stick our stuff.
>
> It's actually quite a bit away from petabytes. Most x86-64 CPUs have
> 48bit of virtual memory, with the OS eating up a far bit of that (so it
> doesn't need to tear down TLBs in system calls). I think cross-platform
> you end up with something around 8TB, up to 64TB on others OSs.
> Now, there are some CPUs coming out with bigger virtual memory sizes,
> but it's going to be longer than I am willing to wait till we are there.

Oh. Well, that's a shame.

>> Any thoughts on what the least painful compromise is here?
>
> I think a reasonable route is having some kind of smart-pointer on C
> level that abstracts away the offset math and allows to use pointers
> locally. Something like void *sptr_deref(sptr *); where the returned
> pointer can be used as long as it is purely in memory. And
> sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
> the same segment.
> + lots of smarts

Can you elaborate on this a little bit? I'm not sure I understand
what you're getting at here.

> I continue to believe that a) using pointers in dynamically allocated
> segments is going to end up with lots of pain. b) the pain from not
> having real pointers is manageable.

Fair opinion, but I think we will certainly need to pass around memory
offsets in some form for certain things. Even for purely internal
parallel sort, the Tuplesort objects need to store the offset of the
actual tuple within the segment. My first guess as to how to do that
was to assume that sizeof(Size) == sizeof(void *) and just store
offsets as a Size, but then I thought maybe it'd be better to do
something like typedef Size relptr. And then I thought, boy, it sucks
not to be able to declare what kind of a thing we're pointing *at*
here, but apart from using C++ I see no solution to that problem. I
guess we could do something like this:

#define relptr(type) Size

So the compiler wouldn't enforce anything, but at least notationally
we'd know what sort of object we were supposedly referencing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2013-12-05 13:00:41 Re: [bug fix] pg_ctl fails with config-only directory
Previous Message Andres Freund 2013-12-05 12:42:21 Re: tracking commit timestamps