Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-12 01:58:18
Message-ID: CAH2-WznJ4KafvbK0NT8dmHNLax02yOFAVUmHReQdQ-qxRWB75A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 11, 2018 at 2:26 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> While the patch contains, as I said before, an excellent set of how-to
> directions explaining how to use the new parallel sort facilities in
> tuplesort.c, there seems to be no such thing for logtape.c, and as a
> result I find it a bit unclear how the interface is supposed to work.
> I think it would be good to add a similar summary here.

Okay. I came up with something for that.

> It seems like the words "leader" and "worker" here refer to the leader
> of a parallel operation and the associated workers, but do we really
> need to make that assumption? Couldn't we generally describe this as
> merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
> into a single LogicalTapeSet that can thereafter be read by the
> process that does the merging?

It's not so much an assumption as it is the most direct way of
referring to these various objects. logtape.c is very clearly a
submodule of tuplesort.c, so this felt okay to me. There are already
several references to what tuplesort.c expects. I'm not going to argue
about it if you insist on this, though I do think that trying to
describe things in more general terms would be a net loss. It would
kind of come off as feigning ignorance IMV. There is nothing that
logtape.c could know less about other than names/roles, and I find it
hard to imagine those changing, even when we add support for
partitioning/distribution sort (where logtape.c handles
"redistribution", something discussed early in this project's
lifetime).

> + /* Pass worker BufFile pieces, and a placeholder leader piece */
> + for (i = 0; i < lts->nTapes; i++)
> + {
> + lt = &lts->tapes[i];
> +
> + /*
> + * Build concatenated view of all BufFiles, remembering the block
> + * number where each source file begins.
> + */
> + if (i < lts->nTapes - 1)
>
> Unless I'm missing something, the "if" condition just causes the last
> pass through this loop to do nothing. If so, why not just change the
> loop condition to i < lts->nTapes - 1 and drop the "if" statement
> altogether?

The last "lt" in the loop is in fact used separately, just outside the
loop. But that use turns out to have been subtly wrong, apparently due
to a problem with converting logtape.c to use the shared buffile
stuff. This buglet would only have caused writing to the leader tape
to break (never trace_sort instrumentation), something that isn't
supported anyway due to the restrictions that shared BufFiles have.
But, we should, on general principle, be able to write to the leader
tape if and when shared buffiles learn to support writing (after
exporting original BufFile in worker).

Buglet fixed in my local working copy. I did so in a way that changes
loop test along the lines you suggest. This should make the whole
design of tape concatenation a bit clearer.

> + char filename[MAXPGPATH] = {0};
>
> I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

Okay.

> + /* Alter worker's tape state (generic values okay for leader) */
>
> What do you mean by generic values?

I mean that the leader's tape doesn't need to have
lt->firstBlockNumber set, because it's empty -- it can remain -1. Same
applies to lt->offsetBlockNumber, too.

I'll remove the text within parenthesis, since it seems redundant
given the structure of the loop.

> + * Each tape is initialized in write state. Serial callers pass ntapes, but
> + * NULL arguments for everything else. Parallel worker callers pass a
> + * shared handle and worker number, but tapeset should be NULL. Leader
> + * passes worker -1, a shared handle, and shared tape metadata. These are
> + * used to claim ownership of worker tapes.
>
> This comment doesn't match the actual function definition terribly
> well. Serial callers don't pass NULL for "everything else", because
> "int worker" is not going to be NULL. For parallel workers, it's not
> entirely obvious whether "a shared handle" means TapeShare *tapes or
> SharedFileSet *fileset. "tapeset" sounds like an argument name, but
> there is no such argument.

Okay. I've tweaked things here.

> lt->max_size looks like it might be an optimization separate from the
> overall patch, but maybe I'm wrong about that.

I think that it's pretty much essential. Currently, the MaxAllocSize
restriction is needed in logtape.c for the same reason that it's
needed anywhere else. Not much to talk about there. The new max_size
thing is about more than that, though -- it's really about not
stupidly allocating up to a full MaxAllocSize when you already know
that you're going to use next to no memory.

You don't have this issue with serial sorts because serial sorts that
only sort a tiny number of tuples never end up as external sorts --
when you end up doing a serial external sort, clearly you're never
going to allocate an excessive amount of memory up front in logtape.c,
because you are by definition operating in a memory constrained
fashion. Not so for parallel external tuplesorts. Think spool2 in a
parallel unique index build, in the case where there are next to no
recently dead tuples (the common case).

> + /* palloc() larger than MaxAllocSize would fail */
> lt->buffer = NULL;
> lt->buffer_size = 0;
> + lt->max_size = MaxAllocSize;
>
> The comment about palloc() should move down to where you assign max_size.

Okay.

> Generally we avoid returning a struct type, so maybe
> LogicalTapeFreeze() should instead grow an out parameter of type
> TapeShare * which it populates only if not NULL.

Okay. I've modified LogicalTapeFreeze(), adding a "share" output
argument and reverting to returning void, as before.

> Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
> if the file doesn't belong to a shared fileset? If you adopt the
> previous suggestion, we can probably just make whether to call this
> contingent on whether the TapeShare * out parameter is provided.

Oops, you're right. It will be taken care of by the
LogicalTapeFreeze() function change signature change you suggested.

> I'm not confident I completely understand what's going on with the
> logtape stuff yet, so I might have more comments (or better ones)
> after I study this further. To your question about whether to go
> ahead and post a new version, I'm OK to keep reviewing this version
> for a little longer or to switch to a new one, as you prefer. I have
> not made any local changes, just written a blizzard of email text.
> :-p

Great. Thanks.

I've caught up with you again. I just need to take a look at what I
came up with with fresh eyes, and maybe do some more testing.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-01-12 02:01:59 Re: [HACKERS] Cache lookup errors with functions manipulation object addresses
Previous Message Thomas Munro 2018-01-12 01:58:12 Re: Add default role 'pg_access_server_files'