Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2016-12-20 22:53:08
Message-ID:	CA+TgmoYP0vzPw64DfMQT1JHY6SzyAvjogLkj3erMZzzN2f9xLA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I find this unification business really complicated. I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I just looked at the buffile.c/buffile.h changes in the latest version
of the patch and I agree with this criticism, though maybe not with
the proposed solution. I actually don't understand what "unification"
is supposed to mean. The patch really doesn't explain that anywhere
that I can see. It says stuff like:

+ * Parallel operations can use an interface to unify multiple worker-owned
+ * BufFiles and a leader-owned BufFile within a leader process. This relies
+ * on various fd.c conventions about the naming of temporary files.

That comment tells you that unification is a thing you can do -- via
an unspecified interface for unspecified reasons using unspecified
conventions -- but it doesn't tell you what the semantics of it are
supposed to be. For example, if we "unify" several BufFiles, do they
then have a shared seek pointer? Do the existing contents effectively
get concatenated in an unpredictable order, or are they all expected
to be empty at the time unification happens? Or something else? It's
fine to make up new words -- indeed, in some sense that is the essence
of writing any complex problem -- but you have to define them. As far
as I can tell, the idea is that we're somehow magically concatenating
the BufFiles into one big super-BufFile, but I'm fuzzy on exactly
what's supposed to be going on there.

It's hard to understand how something like this doesn't leak
resources. Maybe that's been thought about here, but it isn't very
clear to me how it's supposed to work. In Heikki's proposal, if
process A is trying to read a file owned by process B, and process B
dies and removes the file before process A gets around to reading it,
we have got trouble, especially on Windows which apparently has low
tolerance for such things. Peter's proposal avoids that - I *think* -
by making the leader responsible for all resource cleanup, but that's
inferior to the design we've used for other sorts of shared resource
cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
detach always takes responsibility. That avoids assuming that we're
always dealing with a leader-follower situation, it doesn't
categorically require the leader to be the one who creates the shared
resource, and it doesn't require the leader to be the last process to
die.

Imagine a data structure that is stored in dynamic shared memory and
contains space for a filename, a reference count, and a mutex. Let's
call this thing a SharedTemporaryFile or something like that. It
offers these APIs:

extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

After setting aside sizeof(SharedTemporaryFile) bytes in your shared
DSM sgement, you call SharedTemporaryFileInitialize() to initialize
them. Then, every process that cares about the file does
SharedTemporaryFileAttach(), which bumps the reference count and sets
an on_dsm_detach hook to decrement the reference count and unlink the
file if the reference count thereby reaches 0. One of those processes
does SharedTemporaryFileAssign(), which fills in the pathname and
clears FD_TEMPORARY. Then, any process that has attached can call
SharedTemporaryFileGetFile() to get a File which can then be accessed
normally. So, the pattern for parallel sort would be:

- Leader sets aside space and calls SharedTemporaryFileInitialize()
and SharedTemporaryFileAttach().
- The cooperating worker calls SharedTemporaryFileAttach() and then
SharedTemporaryFileAssign().
- The leader then calls SharedTemporaryFileGetFile().

Since the leader can attach to the file before the path name is filled
in, there's no window where the file is at risk of being leaked.
Before SharedTemporaryFileAssign(), the worker is solely responsible
for removing the file; after that call, whichever of the leader and
the worker exits last will remove the file.

> That leaves one problem, though: reusing space in the final merge phase. If
> the tapes being merged belong to different LogicalTapeSets, and create one
> new tape to hold the result, the new tape cannot easily reuse the space of
> the input tapes because they are on different tape sets.

If the worker is always completely finished with the tape before the
leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
the tape and overwrite it like any other?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2016-09-21 16:52:00 from Heikki Linnakangas

Responses

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2016-12-21 01:14:36 from Peter Geoghegan
Re: Parallel tuplesort (for parallel B-Tree index creation) at 2016-12-21 12:04:07 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2016-12-20 22:53:18	Re: Make pg_basebackup -x stream the default
Previous Message	Peter Eisentraut	2016-12-20 21:37:04	Re: Clarifying "server starting" messaging in pg_ctl start without --wait