Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-07 04:43:35
Message-ID: CAH2-WzmJefSkr7HuTBFkfCmafRbQdwOojVS18Uo5MS=VGZoSEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> IIUC worker_wait() is only being used to keep the worker around so its
>> files aren't deleted. Once buffile cleanup is changed to be
>> ref-counted (in an on_dsm_detach hook?) then workers might as well
>> exit sooner, freeing up a worker slot... do I have that right?
>
> Yes. Or at least I think it's very likely that that will end up happening.

I've looked into this, and have a version of the patch where clean-up
occurs when the last backend with a reference to the BufFile goes
away. It seems robust; all of my private tests pass, which includes
things that parallel CREATE INDEX won't use, and yet is added as
infrastructure (e.g., randomAccess recycling of blocks by leader from
workers).

As Thomas anticipated, worker_wait() now only makes workers wait until
the leader comes along to take a reference to their files, at which
point the worker processes can go away. In effect, the worker
processes go away as soon as possible, just as the leader begins its
final on-the-fly merge. At that point, they could be reused by some
other process, of course.

However, there are some specific implementation issues with this that
I didn't quite anticipate. I would like to get feedback on these
issues now, from both Thomas and Robert. The issues relate to how much
the patch can or should "buy into resource management". You might
guess that this new resource management code is something that should
live in fd.c, alongside the guts of temp file resource management,
within the function FileClose(). That way, it would be called by every
possible path that might delete a temp file, including
ResourceOwnerReleaseInternal(). That's not what I've done, though.
Instead, refcount management is limited to a few higher level routines
in buffile.c. Initially, resource management in FileClose() is made to
assume that it must delete the file. Then, if and when directed to by
BufFileClose()/refcount, a backend may determine that it is not its
job to do the deletion -- it will not be the one that must "turn out
the lights", and so indicates to FileClose() that it should not delete
the file after all (it should just release vFDs, close(), and so on).
Otherwise, when refcount reaches zero, temp files are deleted by
FileClose() in more or less the conventional manner.

The fact that there could, in general, be any error that causes us to
attempt a double-deletion (deletion of a temp file from more than one
backend) for a time is less of a problem than you might think. This is
because there is a risk of this only for as long as two backends hold
open the file at the same time. In the case of parallel CREATE INDEX,
this is now the shortest possible period of time, since workers close
their files using BufFileClose() immediately after the leader wakes
them up from a quiescent state. And, if that were to actually happen,
say due to some random OOM error during that small window, the
consequence is no worse than an annoying log message: "could not
unlink file..." (this would come from the second backend that
attempted an unlink()). You would not see this when a worker raised an
error due to a duplicate violation, or any other routine problem, so
it should really be almost impossible.

That having been said, this probably *is* a problematic restriction in
cases where a temp file's ownership is not immediately handed over
without concurrent sharing. What happens to be a small window for the
parallel CREATE INDEX patch probably wouldn't be a small window for
parallel hash join. :-(

It's not hard to see why I would like to do things this way. Just look
at ResourceOwnerReleaseInternal(). Any release of a file happens
during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
shared memory segments happens earlier, during
RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
implement a refcount is using dynamic shared memory, and that seems
hard. There are additional reasons why I suggest we go this way, such
as the fact that all the relevant state belongs to BufFile, which is
implemented a layer above all of the guts of resource management of
temp files within fd.c. I'd have to replicate almost all state in fd.c
to make it all work, which seems like a big modularity violation.

Does anyone have any suggestions on how to tackle this?

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2017-02-07 04:49:24 Re: Passing query string to workers
Previous Message Amit Langote 2017-02-07 04:28:42 Re: asynchronous execution