Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2017-02-10 00:10:14
Message-ID:	CAH2-WzmWtorLU0qi63dTgNbBJPds1wRLDtoZSDRwkRWdvBnMww@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource. But ASLR breaks that, especially for loadable modules. You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se. But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors. Maybe you can think of a way to
> plug that hole too but you're waaaaay over your $8 budget by this
> point.

At the risk of stating the obvious, ISTM that the right way to do
this, at a high level, is to err on the side of unneeded extra
unlink() calls, not leaking files. And, to make the window for problem
("remaining hole that you haven't quite managed to plug") practically
indistinguishable from no hole at all, in a way that's kind of baked
into the API.

It's not like we currently throw an error when there is a problem with
deleting temp files that are no longer needed on resource manager
cleanup. We simply log the fact that it happened, and limp on.

I attach my V8. This does not yet do anything with on_dsm_detach().
I've run out of time to work on it this week, and am starting a new
job next week at VMware, which I'll need time to settle into. So I'm
posting this now, since you can still very much see the direction I'm
going in, and can give me any feedback that you have. If anyone wants
to show me how its done by building on this, and finishing what I have
off, be my guest. The new stuff probably isn't quite as polished as I
would prefer, but time grows short, so I won't withhold it.

Changes:

* Implements refcount thing, albeit in a way that leaves a small
window for double unlink() calls if there is an error during the small
window in which there is worker/leader co-ownership of a BufFile (just
add an "elog(ERROR)" just before leader-as-worker Tuplesort state is
ended within _bt_leafbuild() to see what I mean). This implies that
background workers can be reclaimed once the leader needs to start its
final on-the-fly merge, which is nice. As an example of how that's
nice, this change makes maintenance_work_mem a budget that we more
strictly adhere to.

* Fixes bitrot caused by recent logtape.c bugfix in master branch.

* No local segment is created during unification unless and until one
is required. (In practice, for current use of BufFile infrastructure,
no "local" segment is ever created, even if we force a randomAccess
case using one of the testing GUCs from 0002-* -- we'd have to use
another GUC to *also* force there to be no reclaimation.)

* Better testing. As I just mentioned, we can now force logtape.c to
not reclaim blocks, so you make new local segments as part of a
unified BufFile, which have different considerations from a resource
management point of view. Despite being part of the same "unified"
BufFile from the leader's perspective, it behaves like a local
segment, so it definitely seems like a good idea to have test coverage
for this, at least during development. (I have a pretty rough test
suite that I'm using; development of this patch has been somewhat test
driven.)

* Better encapsulation of BufFile stuff. I am even closer to the ideal
of this whole sharing mechanism being a fairly generic BufFile thing
that logtape.c piggy-backs on without having special knowledge of the
mechanism. It's still true that the mechanism (sharing/unification) is
written principally with logtape.c in mind, but that's just because of
its performance characteristics. Nothing to do with the interface.

* Worked through items raised by Thomas in his 2017-01-30 mail to this thread.

>>>> Secondly, I might not want to be constrained by a
>>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>>> cases where I need to shared a number of batch files that is unknown
>>>> at the start of execution time when the DSM segment is sized (I'll
>>>> write about that shortly on the Parallel Shared Hash thread). Maybe I
>>>> can find a way to get rid of that requirement. Or maybe it could
>>>> support DSA memory too, but I don't think it's possible to use
>>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>>> by the time any given DSM segment's detach hook runs, there's no
>>>> telling which other DSM segments have been detached already, so the
>>>> DSA area may already have partially vanished; some other kind of hook
>>>> that runs earlier would be needed...
>>>
>>> Again, wrench.

I like the wrench analogy too, FWIW.

>> My problem here is that I don't know how many batches I'll finish up
>> creating. In general that's OK because I can hold onto them as
>> private BufFiles owned by participants with the existing cleanup
>> mechanism, and then share them just before they need to be shared (ie
>> when we switch to processing the next batch so they need to be
>> readable by all). Now I only ever share one inner and one outer batch
>> file per participant at a time, and then I explicitly delete them at a
>> time that I know to be safe and before I need to share a new file that
>> would involve recycling the slot, and I'm relying on DSM segment scope
>> cleanup only to handle error paths. That means that in generally I
>> only need space for 2 * P shared BufFiles at a time. But there is a
>> problem case: when the leader needs to exit early, it needs to be able
>> to transfer ownership of any files it has created, which could be more
>> than we planned for, and then not participate any further in the hash
>> join, so it can't participate in the on-demand sharing scheme.

I think that parallel CREATE INDEX can easily live with the
restriction that we need to know how many shared BufFiles are needed
up front. It will either be 1, or 2 (when there are 2 nbtsort.c
spools, for unique index builds). We can also detect when the limit is
already exceeded early, and back out, just as we do when there are no
parallel workers currently available.

>> Then the next problem is that for each BufFile we have to know how
>> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
>> ...), which Peter's code currently captures by publishing the file
>> size in the descriptor... but if a fixed size object must describe N
>> BufFiles, where can I put the size of each one? Maybe I could put it
>> in a header of the file itself (yuck!), or maybe I could decide that I
>> don't care what the size is, I'll simply unlink "foo", then "foo.1",
>> then "foo.2", ... until I get ENOENT.
>
> There's nothing wrong with that algorithm as far as I'm concerned.

I would like to point out, just to be completely clear, that while
this V8 doesn't "do refcounts properly" (it doesn't use a
on_dsm_detach() hook and so on), the only benefit that that would
actually have for parallel CREATE INDEX is that it makes it impossible
that the user could see a spurious ENOENT related log message during
unlink() (I err on the side of doing too much unlinking, not too
little). Which is very unlikely anyway. So, if that's okay for
parallel hash join, as indicated by Robert here, an issue like that
would presumably also be okay for parallel CREATE INDEX. It then
follows that what I'm missing here is something that is only really
needed for the parallel hash join patch anyway.

I really want to help Thomas, and am not shirking what I feel is a
responsibility to assist him. I have every intention of breaking this
down to produce a usable patch that only has the BufFile + resource
managemnt stuff, that follows the interface he sketched as a
requirement for me in his most recent revision of his patch series
("0009-hj-shared-buffile-strawman-v4.patch"). I'm just pointing out
that my patch is reasonably complete as a standalone piece of work
right now, AFAICT.

--
Peter Geoghegan

Attachment	Content-Type	Size
0001-Add-parallel-B-tree-index-build-sorting.patch.gz	application/x-gzip	61.4 KB
0002-Add-temporary-testing-tools.patch.gz	application/x-gzip	5.4 KB

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-02-09 22:31:29 from Robert Haas

Responses

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-02-16 14:28:32 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jim Nasby	2017-02-10 00:14:40	Re: libpq Alternate Row Processor
Previous Message	Thomas Munro	2017-02-09 23:38:45	Re: Parallel tuplesort (for parallel B-Tree index creation)