Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-03-12 22:05:40
Message-ID: CAH2-Wznuf2nwsr6=4n8aRQtqCu3PaFYuvDEoYeEenYK-bXYX4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I do not think there should be any reason why we can't get the
>> resource accounting exactly correct here. If a single backend manages
>> to remove every temporary file that it creates exactly once (and
>> that's currently true, modulo system crashes), a group of cooperating
>> backends ought to be able to manage to remove every temporary file
>> that any of them create exactly once (again, modulo system crashes).
>
> I believe that we are fully in agreement here. In particular, I think
> it's bad that there is an API that says "caller shouldn't throw an
> elog error between these two points", and that will be fixed before
> too long. I just think that it's worth acknowledging a certain nuance.

I attach my V9 of the patch. I came up some stuff for the design of
resource management that I think meets every design goal that we have
for shared/unified BufFiles:

* Avoids both resource leaks, and spurious double-freeing of resources
(e.g., a second unlink() for a file from a different process) when
there are errors. The latter problem was possible before, a known
issue with V8 of the patch. I believe that this revision avoids these
problems in a way that is *absolutely bulletproof* in the face of
arbitrary failures (e.g., palloc() failure) in any process at any
time. Although, be warned that there is a remaining open item
concerning resource management in the leader-as-worker case, which I
go into below.

There are now what you might call "critical sections" in one function.
That is, there are points where we cannot throw an error (without a
BEGIN_CRIT_SECTION()!), but those are entirely confined to unification
code within the leader, where we can be completely sure that no error
can be raised. The leader can even fail before some but not all of a
particular worker's segments are in its local resource manager, and we
still do the right thing. I've been testing this by adding code that
randomly throws errors at points interspersed throughout worker and
leader unification hand-off points. I then leave this stress-test
build to run for a few hours, while monitoring for leaked files and
spurious fd.c reports of double-unlink() and similar issues. Test
builds change LOG to PANIC within several places in fd.c, while
MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ.

All of these guarantees are made without any special care from caller
to buffile.c. The only V9 change to tuplesort.c or logtape.c in this
general area is that they have to pass a dynamic shared memory segment
to buffile.c, so that it can register a new callback. That's it. This
may be of particular interest to Thomas. All complexity is confined to
buffile.c.

* No expansion in the use of shared memory to manage resources.
BufFile refcount is still per-worker. The role of local resource
managers is unchanged.

* Additional complexity over and above ordinary BufFile resource
management is confined to the leader process and its on_dsm_detach()
callback. Only the leader registers a callback. Of course, refcount
management within BufFileClose() can still take place in workers, but
that isn't something that we rely on (that's only for non-error
paths). In general, worker processes mostly have resource managers
managing their temp file segments as a thing that has nothing to do
with BufFiles (BufFiles are still not owned by resowner.c/fd.c --
they're blissfully unaware of all of this stuff).

* In general, unified BufFiles can still be treated in exactly the
same way as conventional BufFiles, and things just work, without any
special cases being exercised internally.

There is still an open item here, though: The leader-as-worker
Tuplesortstate, a special case, can still leak files. So,
stress-testing will only show the patch to be completely robust
against resource leaks when nbtsort.c is modified to enable
FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you
can also modify that file to force there to be arbitrary-many workers
requested (just change "requested = 1" to something else). The
leader-as-worker problem is avoided because we don't have the leader
participating as a worker this way, which would otherwise present
issues for resowner.c that I haven't got around to fixing just yet. It
isn't hard to imagine why this is -- one backend with two FDs for
certain fd.c temp segments is just going to cause problems for
resowner.c without additional special care. Didn't seem worth blocking
on that. I want to prove that my general approach is workable. That
problem is confined to one backend's resource manager when it is the
leader participating as a worker. It is not a refcount problem. The
simplest solution here would be to ban the leader-as-worker case by
contract. Alternatively, we could pass fd.c segments from the
leader-as-worker Tuplesortstate's BufFile to the leader
Tuplesortstate's BufFile without opening or closing anything. This
way, there will be no second vFD entry for any segment at any time.

I've also made several changes to the cost model, changes agreed to
over on the "Cost model for parallel CREATE INDEX" thread. No need for
a recap on what those changes are here. In short, things have been
*significantly* simplified in that area.

Finally, note that I decided to throw out more code within
tuplesort.c. Now, a parallel leader is a thing that is explicitly set
up to be exactly consistent with a conventional/serial external sort
whose merge is about to begin. In particular, it now uses mergeruns().

Robert said that he thinks that this is a patch that is to some degree
a parallelism patch, and to some degree about sorting. I'd say that by
now, it's roughly 5% about sorting, in terms of the proportion of code
that expressly considers sorting. Most of the new stuff in tuplesort.c
is about managing dependencies between participating backends. I've
really focused on avoiding new special cases, especially with V9.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Add-parallel-B-tree-index-build-sorting.patch.gz application/x-gzip 55.6 KB
0002-Add-temporary-testing-tools.patch.gz application/x-gzip 4.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Venkata B Nagothi 2017-03-13 00:06:00 Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Previous Message Pavel Stehule 2017-03-12 21:26:33 Re: possible encoding issues with libxml2 functions