Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-09 22:31:29
Message-ID: CA+TgmoZoycMgCaqq4E4TbHuh74+wtFX+ba2vPGRcop___VSk+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I agree that the pinned segment case doesn't matter right now, I just
> wanted to point it out. I like your $10 wrench analogy, but maybe it
> could be argued that adding a dsm_on_destroy() callback mechanism is
> not only better than adding another refcount to track that other
> refcount, but also a steal at only $8.

If it's that simple, it might be worth doing, but I bet it's not. One
problem is that there's a race condition: there will inevitably be a
period of time after you've called dsm_attach() and before you've
attached to the specific data structure that we're talking about here.
So suppose the last guy who actually knows about this data structure
dies horribly and doesn't clean up because the DSM isn't being
destroyed; moments later, you die horribly before reaching the code
where you attach to this data structure. Oops.

You might think about plugging that hole by moving the registry of
on-destroy functions into the segment itself and making it a shared
resource. But ASLR breaks that, especially for loadable modules. You
could try to fix that problem, in turn, by storing arguments that can
later be passed to load_external_function() instead of a function
pointer per se. But that sounds pretty fragile because some other
backend might not try to load the module until after it's attached the
DSM segment and it might then fail because loading the module runs
_PG_init() which can throw errors. Maybe you can think of a way to
plug that hole too but you're waaaaay over your $8 budget by this
point.

>>> Secondly, I might not want to be constrained by a
>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>> cases where I need to shared a number of batch files that is unknown
>>> at the start of execution time when the DSM segment is sized (I'll
>>> write about that shortly on the Parallel Shared Hash thread). Maybe I
>>> can find a way to get rid of that requirement. Or maybe it could
>>> support DSA memory too, but I don't think it's possible to use
>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>> by the time any given DSM segment's detach hook runs, there's no
>>> telling which other DSM segments have been detached already, so the
>>> DSA area may already have partially vanished; some other kind of hook
>>> that runs earlier would be needed...
>>
>> Again, wrench.
>
> My problem here is that I don't know how many batches I'll finish up
> creating. In general that's OK because I can hold onto them as
> private BufFiles owned by participants with the existing cleanup
> mechanism, and then share them just before they need to be shared (ie
> when we switch to processing the next batch so they need to be
> readable by all). Now I only ever share one inner and one outer batch
> file per participant at a time, and then I explicitly delete them at a
> time that I know to be safe and before I need to share a new file that
> would involve recycling the slot, and I'm relying on DSM segment scope
> cleanup only to handle error paths. That means that in generally I
> only need space for 2 * P shared BufFiles at a time. But there is a
> problem case: when the leader needs to exit early, it needs to be able
> to transfer ownership of any files it has created, which could be more
> than we planned for, and then not participate any further in the hash
> join, so it can't participate in the on-demand sharing scheme.

I thought the idea was that the structure we're talking about here
owns all the files, up to 2 from a leader that wandered off plus up to
2 for each worker. Last process standing removes them. Or are you
saying each worker only needs 2 files but the leader needs a
potentially unbounded number?

> Perhaps we can find a way to describe a variable number of BufFiles
> (ie batches) in a fixed space by making sure the filenames are
> constructed in a way that lets us just have to say how many there are.

That could be done.

> Then the next problem is that for each BufFile we have to know how
> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
> ...), which Peter's code currently captures by publishing the file
> size in the descriptor... but if a fixed size object must describe N
> BufFiles, where can I put the size of each one? Maybe I could put it
> in a header of the file itself (yuck!), or maybe I could decide that I
> don't care what the size is, I'll simply unlink "foo", then "foo.1",
> then "foo.2", ... until I get ENOENT.

There's nothing wrong with that algorithm as far as I'm concerned.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-02-09 22:32:07 Re: amcheck (B-Tree integrity checking tool)
Previous Message Robert Haas 2017-02-09 22:12:58 Re: amcheck (B-Tree integrity checking tool)