Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2017-03-22 16:26:00
Message-ID:	CAH2-WznT9oX4AykOwP6Booy_9Si4xGJ4hZOjvihct1n8igo39g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Actually, that's quite possible with the design I came up with.
>
> I don't think it is. What sequence of calls do the APIs you've
> proposed would accomplish that goal? I don't see anything in this
> patch set that would permit anything other than a handoff from the
> worker to the leader. There seems to be no way for the ref count to
> be more than 1 (or 2?).

See my remarks on this below.

>> The
>> restriction that Thomas can't live with as I've left things is that
>> you have to know the number of BufFiles ahead of time. I'm pretty sure
>> that that's all it is. (I do sympathize with the fact that that isn't
>> very helpful to him, though.)
>
> I feel like there's some cognitive dissonance here. On the one hand,
> you're saying we should use your design.

No, I'm not. I'm saying that my design is complete on its own terms,
and has some important properties that a mechanism like this ought to
have. I think I've been pretty clear on my general uncertainty about
the broader question.

> On the other hand, you are
> admitting that in at least one key respect, it won't meet Thomas's
> requirements. On the third hand, you just said that you weren't
> arguing for two mechanisms for sharing a BufFile across cooperating
> parallel processes. I don't see how you can hold all three of those
> positions simultaneously.

I respect your position as the person that completely owns parallelism
here. You are correct when you say that there has to be some overlap
between the requirements for the mechanisms used by each patch --
there just *has* to be. As I said, I only know very approximately how
much overlap that is or should be, even at this late date, and I am
unfortunately not in a position to spend more time on it to find out.
C'est la vie.

I know that I have no chance of convincing you to adopt my design
here, and you are right not to accept the design, because there is a
bigger picture. And, because it's just too late now. My efforts to get
ahead of that, and anticipate and provide for Thomas' requirements
have failed. I admit that. But, you are asserting that my patch has
specific technical defects that it does not have.

I structured things this way for a reason. You are not required to
agree with me in full to see that I might have had a point. I've
described it as a trade-off already. I think that it will be of
practical value to you to see that trade-off. This insight is what
allowed me to immediately zero in on resource leak bugs in Thomas'
revision of the patch from yesterday.

>> It is true that a worker resowner can unlink() the files
>> mid-unification, in the same manner as with conventional temp files,
>> and not decrement its refcount in shared memory, or care at all in any
>> special way. This is okay because the leader (in the case of parallel
>> tuplesort) will realize that it should not "turn out the lights",
>> finding that remaining reference when it calls BufFileClose() in
>> registered callback, as it alone must. It doesn't matter that the
>> unlink() may have already occurred, or may be just about to occur,
>> because we are only operating on already-opened files, and never on
>> the link itself (we don't have to stat() the file link for example,
>> which is naturally only a task for the unlink()'ing backend anyway).
>> You might say that the worker only blows away the link itself, not the
>> file proper, since it may still be open in leader (say).
>
> Well, that sounds like it's counting on fd.c not to close the file
> descriptor at an inconvenient point in time and reopen it later, which
> is not guaranteed.

It's true that in an error path, if the FD of the file we just opened
gets swapped out, that could happen. That seems virtually impossible,
and in any case the consequence is no worse than a confusing LOG
message. But, yes, that's a weakness.

>> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
>> since only the worker itself can reliably know how many segments it
>> has opened at every single instant in time. Because it's the owner!
>
> Above, you said that your design would allow for a group of processes
> to share access to a file, with the last one that abandons it "turning
> out the lights". But here, you are referring to it as having one
> owner - the "only the worker itself" can know the number of segments.
> Those things are exact opposites of each other.

You misunderstood.

Under your analogy, the worker needs to wait for someone else to enter
the room before leaving, because otherwise, as an "environmentally
conscious" worker, it would be compelled to turn the lights out before
anyone else ever got to do anything with its files. But once someone
else is in the room, the worker is free to leave without turning out
the lights. I could provide a mechanism for the leader, or whatever
the other backend is, to do another hand off. You're right that that
is left unimplemented, but it would be a trivial adjunct to what I
came up with.

> I don't think there's any problem with ignoring ENOENT, and I don't
> think there's any need for a process to know the exact number of
> segments in some temporary file.

You may well be right, but that is just one detail.

>> My patch demonstrably has these properties. I've done quite a bit of
>> fault injection testing to prove it.
>
> I don't understand this comment, because 0 of the 3 properties that I
> just articulated are things which can be proved or disproved by fault
> injection. Fault injection can confirm the presence of bugs or
> suggest their absence, but none of those properties have to do with
> whether there are bugs.

I was unclear -- I just meant (3). Specifically, that resource
ownership has been shown to do be robust under stress testing/fault
injection testing.

Anyway, I will provide some feedback on Thomas' latest revision of
today, before I bow out. I owe him at least that much.

--
Peter Geoghegan

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-03-22 12:44:05 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2017-03-22 16:39:13	Re: increasing the default WAL segment size
Previous Message	Robert Haas	2017-03-22 16:24:32	Re: exposing wait events for non-backends (was: Tracking wait event for latches)