Re: WIP: [[Parallel] Shared] Hash

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-01-11 01:56:50
Message-ID: CAM3SWZRQE+psQuOaU1hVK09PDPrSQ6ydhuv9WbkKnp7DXK+F9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Here is a new WIP patch. I have plenty of things to tidy up (see note
> at end), but the main ideas are now pretty clear and I'd appreciate
> some feedback.

I have some review feedback for your V3. I've chosen to start with the
buffile.c stuff, since of course it might share something with my
parallel tuplesort patch. This isn't comprehensive, but I will have
more comprehensive feedback soon.

I'm not surprised that you've generally chosen to make shared BufFile
management as simple as possible, with no special infrastructure other
than the ability to hold open other backend temp files concurrently
within a worker, and no writing to another worker's temp file, or
shared read pointer. As you put it, everything is immutable. I
couldn't see much opportunity for adding a lot of infrastructure that
wasn't written explicitly as parallel hash join code/infrastructure.
My sense is that that was a good decision. I doubted that you'd ever
want some advanced, generic shared BufFile thing with multiple read
pointers, built-in cache coherency, etc. (Robert seemed to think that
you'd go that way, though.)

Anyway, some more specific observations:

* ISTM that this is the wrong thing for shared BufFiles:

> +BufFile *
> +BufFileImport(BufFileDescriptor *descriptor)
> +{
...
> + file->isInterXact = true; /* prevent cleanup by this backend */

There is only one user of isInterXact = true BufFiles at present,
tuplestore.c. It, in turn, only does so for cases that require
persistent tuple stores. A quick audit of these tuplestore.c callers
show this to just be cursor support code within portalmem.c. Here is
the relevant tuplestore_begin_heap() rule that that code adheres to,
unlike the code I've quoted above:

* interXact: if true, the files used for on-disk storage persist beyond the
* end of the current transaction. NOTE: It's the caller's responsibility to
* create such a tuplestore in a memory context and resource owner that will
* also survive transaction boundaries, and to ensure the tuplestore is closed
* when it's no longer wanted.

I don't think it's right for buffile.c to know anything about file
paths directly -- I'd say that that's a modularity violation.
PathNameOpenFile() is called by very few callers at the moment, all of
them very low level (e.g. md.c), but you're using it within buffile.c
to open a path to the file that you obtain from shared memory
directly. This is buggy because the following code won't be reached in
workers that call your BufFileImport() function:

/* Mark it for deletion at close */
VfdCache[file].fdstate |= FD_TEMPORARY;

/* Register it with the current resource owner */
if (!interXact)
{
VfdCache[file].fdstate |= FD_XACT_TEMPORARY;

ResourceOwnerEnlargeFiles(CurrentResourceOwner);
ResourceOwnerRememberFile(CurrentResourceOwner, file);
VfdCache[file].resowner = CurrentResourceOwner;

/* ensure cleanup happens at eoxact */
have_xact_temporary_files = true;
}

Certainly, you don't want the "Mark it for deletion at close" bit.
Deletion should not happen at eoxact for non-owners-but-sharers
(within FileClose()), but you *do* want CleanupTempFiles() to call
FileClose() for the virtual file descriptors you've opened in the
backend, to do some other cleanup. In general, you want to buy into
resource ownership for workers. As things stand, I think that this
will leak virtual file descriptors. That's really well hidden because
there is a similar CleanupTempFiles() call at proc exit, I think.
(Didn't take the time to make sure that that's what masked problems.
I'm sure that you want minimal divergence with serial cases,
resource-ownership-wise, in any case.)

Instead of all this, I suggest copying some of my changes to fd.c, so
that resource ownership within fd.c differentiates between a vfd that
is owned by the backend in the conventional sense, including having a
need to delete at eoxact, as well as a lesser form of ownership where
deletion should not happen. Maybe you'll end up using my BufFileUnify
interface [1] within workers (instead of just within the leader, as
with parallel tuplesort), and have it handle all of that for you.
Currently, that would mean that there'd be an unused/0 sized "local"
segment for the unified BufFile, but I was thinking of making that not
happen unless and until a new segment is actually needed, so even that
minor wart wouldn't necessarily affect you.

> Some assorted notes on the status: I need to do some thinking about
> the file cleanup logic: both explicit deletes at the earliest possible
> time, and failure/error paths. Currently the creator of each file is
> responsible for cleaning it up, but I guess if the creator aborts
> early the file disappears underneath the others' feet, and then I
> guess they might raise a confusing error report that races against the
> root cause error report; I'm looking into that. Rescans and skew
> buckets not finished yet.

The rescan code path seems to segfault when the regression tests are
run. There is a NULL pointer dereference here:

> @@ -985,6 +1855,14 @@ ExecReScanHashJoin(HashJoinState *node)
> node->hj_HashTable = NULL;
> node->hj_JoinState = HJ_BUILD_HASHTABLE;
>
> + if (HashJoinTableIsShared(node->hj_HashTable))
> + {
> + /* Coordinate a rewind to the shared hash table creation phase. */
> + BarrierWaitSet(&hashNode->shared_table_data->barrier,
> + PHJ_PHASE_BEGINNING,
> + WAIT_EVENT_HASHJOIN_REWINDING3);
> + }
> +

Clearly, HashJoinTableIsShared() should not be called when its
argument (in this case node->hj_HashTable) is NULL.

In general, I think you should try to set expectations about what
happens when the regression tests run up front, because that's usually
the first thing reviewers do.

Various compiler warnings on my system:

/home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHash.c:1376:7:
warning: variable ‘size_before_shrink’ set but not used
[-Wunused-but-set-variable]
Size size_before_shrink = 0;
^
...

/home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:
In function ‘ExecHashJoinCloseBatch’:
/home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:1548:28:
warning: variable ‘participant’ set but not used
[-Wunused-but-set-variable]
HashJoinParticipantState *participant;
^
/home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:
In function ‘ExecHashJoinRewindBatches’:
/home/pg/pgbuild/builds/root/../../postgresql/src/backend/executor/nodeHashjoin.c:1587:23:
warning: variable ‘batch_reader’ set but not used
[-Wunused-but-set-variable]
HashJoinBatchReader *batch_reader;
^

Is this change really needed?:

> --- a/src/backend/executor/nodeSeqscan.c
> +++ b/src/backend/executor/nodeSeqscan.c
> @@ -31,6 +31,8 @@
> #include "executor/nodeSeqscan.h"
> #include "utils/rel.h"
>
> +#include <unistd.h>
> +
> static void InitScanRelation(SeqScanState *node, EState *estate, int eflags);
> static TupleTableSlot *SeqNext(SeqScanState *node);
>

That's all I have for now...

[1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#buffile.c.2C_and_BufFile_unification
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-01-11 02:05:09 Re: RustgreSQL
Previous Message Craig Ringer 2017-01-11 01:28:07 Re: proposal: session server side variables