Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-12-21 01:14:36
Message-ID: CAM3SWZRO5He5y-h1u8ozLgZb=9srOA0f169qYbKYKrhvoWzxQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> What I have in mind is something like the attached patch. It refactors
>> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
>> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
>> doesn't have the concept of a tape number anymore, it can contain any number
>> of tapes, and you can create more on the fly. With that, it'd be fairly easy
>> to make tuplesort.c merge LogicalTapes that came from different tape sets,
>> backed by different BufFiles. I think that'd avoid much of the unification
>> code.
>
> I just looked at the buffile.c/buffile.h changes in the latest version
> of the patch and I agree with this criticism, though maybe not with
> the proposed solution. I actually don't understand what "unification"
> is supposed to mean. The patch really doesn't explain that anywhere
> that I can see. It says stuff like:
>
> + * Parallel operations can use an interface to unify multiple worker-owned
> + * BufFiles and a leader-owned BufFile within a leader process. This relies
> + * on various fd.c conventions about the naming of temporary files.

Without meaning to sound glib, unification is the process by which
parallel CREATE INDEX has the leader read temp files from workers
sufficient to complete its final on-the-fly merge. So, it's a
terminology that's bit like "speculative insertion" was up until
UPSERT was committed: a concept that is somewhat in flux, and
describes a new low-level mechanism built to support a higher level
operation, which must accord with a higher level set of requirements
(so, for speculative insertion, that would be avoiding "unprincipled
deadlocks" and so on). That being the case, maybe "unification" isn't
useful as an precise piece of terminology at this point, but that will
change.

While I'm fairly confident that I basically have the right idea with
this patch, I think that you are better at judging the ins and outs of
resource management than I am, not least because of the experience of
working on parallel query itself. Also, I'm signed up to review
parallel hash join in large part because I think there might be some
convergence concerning the sharing of BufFiles among parallel workers.
I don't think I'm qualified to judge what a general abstraction like
this should look like, but I'm trying to get there.

> That comment tells you that unification is a thing you can do -- via
> an unspecified interface for unspecified reasons using unspecified
> conventions -- but it doesn't tell you what the semantics of it are
> supposed to be. For example, if we "unify" several BufFiles, do they
> then have a shared seek pointer?

No.

> Do the existing contents effectively
> get concatenated in an unpredictable order, or are they all expected
> to be empty at the time unification happens? Or something else?

The order is the same order as ordinal identifiers are assigned to
workers within tuplesort.c, which is undefined, with the notable
exception of the leader's own identifier (-1) and area of the unified
BufFile space (this is only relevant in randomAccess cases, where
leader may write stuff out to its own reserved part of the BufFile
space). It only matters that the bit of metadata in shared memory is
in that same order, which it clearly will be. So, it's unpredictable,
but in the same way that ordinal identifiers are assigned in a
not-well-defined order; it doesn't or at least shouldn't matter. We
can imagine a case where it does matter, and we probably should, but
that case isn't parallel CREATE INDEX.

>It's fine to make up new words -- indeed, in some sense that is the essence
> of writing any complex problem -- but you have to define them.

I invite you to help me define this new word.

> It's hard to understand how something like this doesn't leak
> resources. Maybe that's been thought about here, but it isn't very
> clear to me how it's supposed to work.

I agree that it would be useful to centrally document what all this
unification stuff is about. Suggestions on where that should live are
welcome.

> In Heikki's proposal, if
> process A is trying to read a file owned by process B, and process B
> dies and removes the file before process A gets around to reading it,
> we have got trouble, especially on Windows which apparently has low
> tolerance for such things. Peter's proposal avoids that - I *think* -
> by making the leader responsible for all resource cleanup, but that's
> inferior to the design we've used for other sorts of shared resource
> cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
> detach always takes responsibility.

Maybe it's inferior to that, but I think what Heikki proposes is more
or less complementary to what I've proposed, and has nothing to do
with resource management and plenty to do with making the logtape.c
interface look nice, AFAICT. It's also about refactoring/simplifying
logtape.c itself, while we're at it. I believe that Heikki has yet to
comment either way on my approach to resource management, one aspect
of the patch that I was particularly keen on your looking into.

The theory of operation here is that workers own their own BufFiles,
and are responsible for deleting them when they die. The assumption,
rightly or wrongly, is that it's sufficient that workers flush
everything out (write out temp files), and yield control to the
leader, which will open their temp files for the duration of the
leader final on-the-fly merge. The resource manager in the leader
knows it isn't supposed to ever delete worker-owned files (just
close() the FDs), and the leader errors if it cannot find temp files
that match what it expects. If there is a an error in the leader, it
shuts down workers, and they clean up, more than likely. If there is
an error in the worker, or if the files cannot be deleted (e.g., if
there is a classic hard crash scenario), we should also be okay,
because nobody will trip up on some old temp file from some worker,
since fd.c has some gumption about what workers need to do (and what
the leader needs to avoid) in the event of a hard crash. I don't see a
risk of file descriptor leaks, which may or may not have been part of
your concern (please clarify).

> That avoids assuming that we're
> always dealing with a leader-follower situation, it doesn't
> categorically require the leader to be the one who creates the shared
> resource, and it doesn't require the leader to be the last process to
> die.

I have an open mind about that, especially given the fact that I hope
to generalize the unification stuff further, but I am not aware of any
reason why that is strictly necessary.

> Imagine a data structure that is stored in dynamic shared memory and
> contains space for a filename, a reference count, and a mutex. Let's
> call this thing a SharedTemporaryFile or something like that. It
> offers these APIs:
>
> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

I'm a little bit tired right now, and I have yet to look at Thomas'
parallel hash join patch in any detail. I'm interested in what you
have to say here, but I think that I need to learn more about its
requirements in order to have an informed opinion.

>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

I'll remind you that parallel CREATE INDEX doesn't actually ever need
to be randomAccess, and so we are not actually going to ever need to
do this as things stand. I wrote the code that way in order to not
break the existing interface, which seemed like a blocker to posting
the patch. I am open to the idea of such an "adoption" occurring, even
though it actually wouldn't help any case that exists in the patch as
proposed. I didn't go that far in part because it seemed premature,
given that nobody had looked at my work to date at the time, and given
the fact that there'd be no initial user-visible benefit, and given
how the exact meaning of "unification" was (and is) somewhat in flux.

I see no good reason to not do that, although that might change if I
actually seriously undertook to teach the leader about this kind of
"adoption". I suspect that the interface specification would make for
confusing reading, which isn't terribly appealing, but I'm sure I
could manage to make it work given time.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-12-21 01:21:09 Re: Protect syscache from bloating with negative cache entries
Previous Message Stephen Frost 2016-12-21 00:54:52 Re: pg_authid.rolpassword format (was Re: Password identifiers, protocol aging and SCRAM protocol)