Re: Dynamic shared memory areas

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dynamic shared memory areas
Date: 2016-12-01 11:33:50
Message-ID: CAEepm=3U7+Ro7=ECeQuAZoeFXs8iDVX56NXGCV7z3=+H+Wd0Sw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 30, 2016 at 4:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> More review:

Thanks!

> + * For large objects, we just stick all of the allocations in fullness class
> + * 0. Since we can just return the space directly to the free page manager,
> + * we don't really need them on a list at all, except that if someone wants
> + * to bulk release everything allocated using this BlockAreaContext, we
> + * have no other way of finding them.
>
> This comment is out-of-date.

Removed.

> + /*
> + * If this is the only span, and there is no active
> span, then maybe
> + * we should probably move this span to fullness class
> 1. (Otherwise
> + * if you allocate exactly all the objects in the only
> span, it moves
> + * to class 3, then you free them all, it moves to 2,
> and then is
> + * given back, leaving no active span).
> + */
>
> "maybe we should probably" seems to have one more doubt-expressing
> word than it needs.

Fixed.

> + if (size_class == DSA_SCLASS_SPAN_LARGE)
> + /* Large object frees give back segments
> aggressively already. */
> + continue;
>
> We generally use braces in this kind of case.

Fixed.

> + * Search the fullness class 1 only. That is where we
> expect to find
>
> extra "the"

Fixed.

> + /* Call for effect (we don't need the result). */
> + get_segment_by_index(area, index);
> ...
> + return area->segment_maps[index].mapped_address + offset;
>
> It isn't guaranteed that area->segment_maps[index].mapped_address will
> be non-NULL on return from get_segment_by_index, and then this
> function will return a completely bogus pointer to the caller. I
> think you should probably elog() instead.

Hmm. Right. In fact it's never OK to ask for a segment by index when
that segment is gone since that implies an access-after-free so there
is no reason for NULL to be handled by callers. I have changed
get_segment_by_index to raise an error..

> + elog(ERROR, "dsa: can't attach to area handle %u", handle);
>
> Avoid ":" in elog messages. You don't really need to - and it isn't
> project style to - tag these with "dsa:"; that's what \errverbose or
> \set VERBOSITY verbose is for. In this particular case, I might just
> adopt the formulation from parallel.c:
>
> ereport(ERROR,
>
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("could not map dynamic shared
> memory segment")));

Fixed.

> + elog(FATAL,
> + "dsa couldn't find run of pages:
> fpm_largest out of sync");
>
> Here I'd go with "dsa could not find %u free pages".

Fixed.

> + elog(ERROR, "dsa_pin: area already pinned");
>
> "dsa_area already pinned"

Fixed.

> + elog(ERROR, "dsa_unpin: area not pinned");
>
> "dsa_area not pinned"

Fixed.

> + if (segment == NULL)
> + elog(ERROR, "dsa: can't attach to segment");
>
> As above.

Fixed.

> +static dsa_segment_map *
> +get_segment_by_index(dsa_area *area, dsa_segment_index index)
> +{
> + if (unlikely(area->segment_maps[index].mapped_address == NULL))
> + {
> + dsm_handle handle;
> + dsm_segment *segment;
> + dsa_segment_map *segment_map;
> +
> + handle = area->control->segment_handles[index];
>
> Don't you need to acquire the lock for this?

No. I've updated the comments to explain, and refactored a bit.

I'll explain here in different words here: This is memory, you are a
C programmer, and as with malloc/free, referencing memory that has
been freed invokes undefined behaviour possibly including but not
limited to demons flying out of your nose. When you call
dsa_get_address(some_dsa_pointer) or dsa_free(some_dsa_pointer) you
are asserting that the address points to memory allocated with
dsa_allocate from this area that has not yet been freed. Given that
assertion, area->control->segment_handles[index] (where index is
extracted from the address) must be valid and cannot change under your
feet. control->segment_handles[index] can only change after
everything allocated from that whole segment has been freed; you can
think of it as 'locked' as long as any live object exists in the
segment it corresponds to.

In general I'm trying not to do anything too clever in the first
version of DSA: it uses plain old LWLock for each size-class's pool
and then an area-wide LWLock for segment operations. But in the
particular case of dsa_get_address, I think it's really important for
the viability of DSA for these address translations to be fast in
likely path, hence my desire to figure out a protocol for lock-free
address translation even though segments come and go.

> + /* Check all currently mapped segments to find what's
> been freed. */
> + for (i = 0; i <= area->high_segment_index; ++i)
> + {
> + if (area->segment_maps[i].header != NULL &&
> + area->segment_maps[i].header->freed)
> + {
> + dsm_detach(area->segment_maps[i].segment);
> + area->segment_maps[i].segment = NULL;
> + area->segment_maps[i].header = NULL;
> + area->segment_maps[i].mapped_address = NULL;
> + }
> + }
> + area->freed_segment_counter = freed_segment_counter;
>
> And this?

Hmm. I had a theory for why that didn't need to be locked, though it
admittedly lacked a necessary barrier -- d'oh. But I'll spare you the
details and just lock it because this is not a hot path and it's much
simpler that way.

I've also refactored that code into a new static function
check_for_freed_segments, because I realised that dsa_free needs the
same treatment as dsa_get_address. The checking for freed segments
was also happening at the wrong time, which I've now straightened out
-- that must happen before you arrive into a get_segment_index, as
described in the copious new comments. Thoughts?

> +/*
> + * Release a DSA area that was produced by dsa_create_in_place or
> + * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX'
> + * callbacks so that this is managed automatically, because failure to release
> + * an area created in-place leaks its segments permanently.
> + */
> +void
> +dsa_release_in_place(void *place)
> +{
> + decrement_reference_count((dsa_area_control *) place);
> +}
>
> Since this seems to be the only caller of decrement_reference_count,
> you could just put the logic here.

Ok, done.

> The contract for this function is
> also a bit unclear from the header comment. I initially thought that
> it was your intention that this should be called from every process
> that has either created or attached the segment.

That is indeed my intention.

> But that doesn't
> seem like it will work, because decrement_reference_count calls
> dsm_unpin_segment on every segment, and a segment can only be pinned
> once, so everybody else would fail. So maybe the idea is that ANY ONE
> process has to call dsa_release_in_place. But then that could lead to
> failures in other backends inside get_segment_by_index(), because
> segments they don't have mapped might already be gone. OK, third try:
> maybe the idea is that the LAST process out has to call
> dsa_release_in_place(). But how do the various cooperating processes
> know which one that is?

It decrements the reference count for the area, but only unpins the
segments if the reference count reaches zero:

Assert(control->refcnt > 0);
if (--control->refcnt == 0)
{
/* ... unpin all the segments ... */
}

> I've also realized another thing that's not so good about this:
> superblocks are 64kB, so allocating 64kB of initial space probably
> just wastes most of it. I think we want to either allocate just
> enough space to hold the control information, or else that much space
> plus space for at least a few superblocks. I'm inclined to go the
> first way, because it seems a bit overenthusiastic to allocate 256kB
> or 512kB just on the off chance we might need it. On the other hand,
> including a few bytes in the control segment so that we don't need to
> allocate 1MB segment that we might not need sounds pretty sharp.
> Maybe DSA can expose an API that returns the number of bytes that will
> be needed for the control structure, and then the caller can arrange
> for that space to be available during the Estimate phase...

Yeah, I also thought about that, but didn't try to do better before
because I couldn't see how to make a nice macro for this without
dragging a ton of internal stuff out into the header. I have written
a new function dsa_minimum_size(). The caller can use that number
directly to get a minimal in-place area that will immediately create
an extra DSM segment as soon as you call dsa_allocate. Unfortunately
you can't really add more to that number with predictable results
unless you know some internal details and your future allocation
pattern: to avoid extra segment creation, you'd need to add 4KB for a
block of spans and then 64KB for each size class you plan to allocate,
and of course that might change. But at least it allows us to create
an in-place DSA area for every parallel query cheaply, and then defer
creation of the first DSM segment until the first time someone tries
to allocate, which seems about right to me.

And in response to your earlier email:

On Tue, Nov 29, 2016 at 7:48 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> shm_mq_attach() made the opposite decision about how to solve this
> problem, and frankly I think that API is a lot more convenient: if the
> first argument to shm_mq_attach() happens to be located inside of a
> DSM, you can pass the DSM as the second argument and it registers the
> on_dsm_detach() hook for you. If not, you can pass NULL and deal with
> it in some other way. But this makes the common case very simple.

Ok, I've now done the same.

I feel like some more general destructor callback for objects in
containing object is wanted here, rather than sticking dsm_segment *
into various constructor-like functions, but I haven't thought
seriously about that and I'm not arguing that case now.

Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
test-dsa.patch application/octet-stream 11.6 KB
dsa-v8.patch application/octet-stream 141.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-12-01 11:35:36 Re: Creating a DSA area to provide work space for parallel execution
Previous Message Michael Meskes 2016-12-01 11:27:44 Re: Fix typo in ecpg.sgml