Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-08 10:36:49
Message-ID: CAEepm=1gF0q04RhgXUzobZeXYqna9eHDLEY5YZkLx_yhmPxbHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 8, 2017 at 8:40 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> Does anyone have any suggestions on how to tackle this?
>
> Hmm. One approach might be like this:
>
> [hand-wavy stuff]

Thinking a bit harder about this, I suppose there could be a kind of
object called a SharedBufFileManager (insert better name) which you
can store in a DSM segment. The leader backend that initialises a DSM
segment containing one of these would then call a constructor function
that sets an internal refcount to 1 and registers an on_dsm_detach
callback for its on-detach function. All worker backends that attach
to the DSM segment would need to call an attach function for the
SharedBufFileManager to increment a refcount and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks. Then, when a BufFile is to be shared (AKA exported, made
unifiable), a SharedBufFile object can be initialised somewhere in the
same DSM segment and registered with the SharedBufFileManager.
Internally all registered SharedBufFile objects would be linked
together using offsets from the start of the DSM segment for link
pointers. Now when SharedBufFileManager's on-detach function runs, it
decrements the refcount in the SharedBufFileManager, and if that
reaches zero then it runs a destructor that spins through the list of
SharedBufFile objects deleting files that haven't already been deleted
explicitly.

I retract the pin/unpin and per-file refcounting stuff I mentioned
earlier. You could make the default that all files registered with a
SharedBufFileManager survive until the containing DSM segment is
detached everywhere using that single refcount in the
SharedBufFileManager object, but also provide a 'no really delete this
particular shared file now' operation for client code that knows it's
safe to do that sooner (which would be the case for me, I think). I
don't think per-file refcounts are needed.

There are a couple of problems with the above though. Firstly, doing
reference counting in DSM segment on-detach hooks is really a way to
figure out when the DSM segment is about to be destroyed by keeping a
separate refcount in sync with the DSM segment's refcount, but it
doesn't account for pinned DSM segments. It's not your use-case or
mine currently, but someone might want a DSM segment to live even when
it's not attached anywhere, to be reattached later. If we're trying
to use DSM segment lifetime as a scope, we'd be ignoring this detail.
Perhaps instead of adding our own refcount we need a new kind of hook
on_dsm_destroy. Secondly, I might not want to be constrained by a
fixed-sized DSM segment to hold my SharedBufFile objects... there are
cases where I need to shared a number of batch files that is unknown
at the start of execution time when the DSM segment is sized (I'll
write about that shortly on the Parallel Shared Hash thread). Maybe I
can find a way to get rid of that requirement. Or maybe it could
support DSA memory too, but I don't think it's possible to use
on_dsm_detach-based cleanup routines that refer to DSA memory because
by the time any given DSM segment's detach hook runs, there's no
telling which other DSM segments have been detached already, so the
DSA area may already have partially vanished; some other kind of hook
that runs earlier would be needed...

Hmm.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yuriy Zhuravlev 2017-02-08 11:04:39 Re: WIP: About CMake v2
Previous Message Amit Kapila 2017-02-08 10:36:03 Re: pg_stat_wal_write statistics view