Re: WIP: [[Parallel] Shared] Hash

From: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-01-12 11:02:40
Message-ID: CAOGQiiOJji9GbLwfT0vBuixdkgAsxzZjTpPxsO6BiTLD--SzYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 12, 2017 at 9:07 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com
> wrote:

> On Wed, Jan 11, 2017 at 2:56 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> > On Fri, Jan 6, 2017 at 12:01 PM, Thomas Munro
> > <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> >> Here is a new WIP patch. I have plenty of things to tidy up (see note
> >> at end), but the main ideas are now pretty clear and I'd appreciate
> >> some feedback.
> >
> > I have some review feedback for your V3. I've chosen to start with the
> > buffile.c stuff, since of course it might share something with my
> > parallel tuplesort patch. This isn't comprehensive, but I will have
> > more comprehensive feedback soon.
>
> Thanks!
>
> > I'm not surprised that you've generally chosen to make shared BufFile
> > management as simple as possible, with no special infrastructure other
> > than the ability to hold open other backend temp files concurrently
> > within a worker, and no writing to another worker's temp file, or
> > shared read pointer. As you put it, everything is immutable. I
> > couldn't see much opportunity for adding a lot of infrastructure that
> > wasn't written explicitly as parallel hash join code/infrastructure.
> > My sense is that that was a good decision. I doubted that you'd ever
> > want some advanced, generic shared BufFile thing with multiple read
> > pointers, built-in cache coherency, etc. (Robert seemed to think that
> > you'd go that way, though.)
>
> Right, this is extremely minimalist infrastructure. fd.c is
> unchanged. buffile.c only gains the power to export/import read-only
> views of BufFiles. There is no 'unification' of BufFiles: each hash
> join participant simply reads from the buffile it wrote, and then
> imports and reads from its peers' BufFiles, until all are exhausted;
> so the 'unification' is happening in caller code which knows about the
> set of participants and manages shared read positions. Clearly there
> are some ownership/cleanup issues to straighten out, but I think those
> problems are fixable (probably involving refcounts).
>
> I'm entirely willing to throw that away and use the unified BufFile
> concept, if it can be extended to support multiple readers of the
> data, where every participant unifies the set of files. I have so far
> assumed that it would be most efficient for each participant to read
> from the file that it wrote before trying to read from files written
> by other participants. I'm reading your patch now; more soon.
>
> > Anyway, some more specific observations:
> >
> > * ISTM that this is the wrong thing for shared BufFiles:
> >
> >> +BufFile *
> >> +BufFileImport(BufFileDescriptor *descriptor)
> >> +{
> > ...
> >> + file->isInterXact = true; /* prevent cleanup by this backend */
> >
> > There is only one user of isInterXact = true BufFiles at present,
> > tuplestore.c. It, in turn, only does so for cases that require
> > persistent tuple stores. A quick audit of these tuplestore.c callers
> > show this to just be cursor support code within portalmem.c. Here is
> > the relevant tuplestore_begin_heap() rule that that code adheres to,
> > unlike the code I've quoted above:
> >
> > * interXact: if true, the files used for on-disk storage persist beyond
> the
> > * end of the current transaction. NOTE: It's the caller's
> responsibility to
> > * create such a tuplestore in a memory context and resource owner that
> will
> > * also survive transaction boundaries, and to ensure the tuplestore is
> closed
> > * when it's no longer wanted.
>
> Hmm. Yes, that is an entirely bogus use of isInterXact. I am
> thinking about how to fix that with refcounts.
>
> > I don't think it's right for buffile.c to know anything about file
> > paths directly -- I'd say that that's a modularity violation.
> > PathNameOpenFile() is called by very few callers at the moment, all of
> > them very low level (e.g. md.c), but you're using it within buffile.c
> > to open a path to the file that you obtain from shared memory
>
> Hmm. I'm not seeing the modularity violation. buffile.c uses
> interfaces already exposed by fd.c to do this: OpenTemporaryFile,
> then FilePathName to find the path, then PathNameOpenFile to open from
> another process. I see that your approach instead has client code
> provide more meta data so that things can be discovered, which may
> well be a much better idea.
>
> > directly. This is buggy because the following code won't be reached in
> > workers that call your BufFileImport() function:
> >
> > /* Mark it for deletion at close */
> > VfdCache[file].fdstate |= FD_TEMPORARY;
> >
> > /* Register it with the current resource owner */
> > if (!interXact)
> > {
> > VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
> >
> > ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> > ResourceOwnerRememberFile(CurrentResourceOwner, file);
> > VfdCache[file].resowner = CurrentResourceOwner;
> >
> > /* ensure cleanup happens at eoxact */
> > have_xact_temporary_files = true;
> > }
>
> Right, that is a problem. A refcount mode could fix that; virtual
> file descriptors would be closed in every backend using the current
> resource owner, and the files would be deleted when the last one turns
> out the lights.
>
> > Certainly, you don't want the "Mark it for deletion at close" bit.
> > Deletion should not happen at eoxact for non-owners-but-sharers
> > (within FileClose()), but you *do* want CleanupTempFiles() to call
> > FileClose() for the virtual file descriptors you've opened in the
> > backend, to do some other cleanup. In general, you want to buy into
> > resource ownership for workers. As things stand, I think that this
> > will leak virtual file descriptors. That's really well hidden because
> > there is a similar CleanupTempFiles() call at proc exit, I think.
> > (Didn't take the time to make sure that that's what masked problems.
> > I'm sure that you want minimal divergence with serial cases,
> > resource-ownership-wise, in any case.)
> >
> > Instead of all this, I suggest copying some of my changes to fd.c, so
> > that resource ownership within fd.c differentiates between a vfd that
> > is owned by the backend in the conventional sense, including having a
> > need to delete at eoxact, as well as a lesser form of ownership where
> > deletion should not happen. Maybe you'll end up using my BufFileUnify
> > interface [1] within workers (instead of just within the leader, as
> > with parallel tuplesort), and have it handle all of that for you.
> > Currently, that would mean that there'd be an unused/0 sized "local"
> > segment for the unified BufFile, but I was thinking of making that not
> > happen unless and until a new segment is actually needed, so even that
> > minor wart wouldn't necessarily affect you.
>
> Ok, I'm studying that code now.
>
> >> Some assorted notes on the status: I need to do some thinking about
> >> the file cleanup logic: both explicit deletes at the earliest possible
> >> time, and failure/error paths. Currently the creator of each file is
> >> responsible for cleaning it up, but I guess if the creator aborts
> >> early the file disappears underneath the others' feet, and then I
> >> guess they might raise a confusing error report that races against the
> >> root cause error report; I'm looking into that. Rescans and skew
> >> buckets not finished yet.
> >
> > The rescan code path seems to segfault when the regression tests are
> > run. There is a NULL pointer dereference here:
> >
> >> @@ -985,6 +1855,14 @@ ExecReScanHashJoin(HashJoinState *node)
> >> node->hj_HashTable = NULL;
> >> node->hj_JoinState = HJ_BUILD_HASHTABLE;
> >>
> >> + if (HashJoinTableIsShared(node->hj_HashTable))
> >> + {
> >> + /* Coordinate a rewind to the shared hash table
> creation phase. */
> >> + BarrierWaitSet(&hashNode->shared_table_data->barrier,
> >> + PHJ_PHASE_BEGINNING,
> >> + WAIT_EVENT_HASHJOIN_REWINDING3);
> >> + }
> >> +
> >
> > Clearly, HashJoinTableIsShared() should not be called when its
> > argument (in this case node->hj_HashTable) is NULL.
> >
> > In general, I think you should try to set expectations about what
> > happens when the regression tests run up front, because that's usually
> > the first thing reviewers do.
>
> Apologies, poor form. That block can be commented out for now because
> rescan support is obviously incomplete, and I didn't mean to post it
> that way. Doing so reveals two remaining test failures: "join" and
> "rowsecurity" managed to lose a couple of rows. Oops. I will figure
> out what I broke and have a fix for that in my next version.
>
> > Various compiler warnings on my system:
> >
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHash.c:1376:7:
> > warning: variable ‘size_before_shrink’ set but not used
> > [-Wunused-but-set-variable]
> > Size size_before_shrink = 0;
> > ^
>
> In this case it was only used in dtrace builds; I will make sure any
> such code is compiled out when in non-dtrace builds.
>
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:
> > In function ‘ExecHashJoinCloseBatch’:
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:1548:28:
> > warning: variable ‘participant’ set but not used
> > [-Wunused-but-set-variable]
> > HashJoinParticipantState *participant;
> > ^
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:
> > In function ‘ExecHashJoinRewindBatches’:
> > /home/pg/pgbuild/builds/root/../../postgresql/src/backend/
> executor/nodeHashjoin.c:1587:23:
> > warning: variable ‘batch_reader’ set but not used
> > [-Wunused-but-set-variable]
> > HashJoinBatchReader *batch_reader;
> > ^
> >
> > Is this change really needed?:
> >
> >> --- a/src/backend/executor/nodeSeqscan.c
> >> +++ b/src/backend/executor/nodeSeqscan.c
> >> @@ -31,6 +31,8 @@
> >> #include "executor/nodeSeqscan.h"
> >> #include "utils/rel.h"
> >>
> >> +#include <unistd.h>
> >> +
> >> static void InitScanRelation(SeqScanState *node, EState *estate, int
> eflags);
> >> static TupleTableSlot *SeqNext(SeqScanState *node);
>
> Right, will clean up.
>
> > That's all I have for now...
>
> Thanks! I'm away from my computer for a couple of days but will have
> a new patch series early next week, and hope to have a better handle
> on what's involved in adopting the 'unification' approach here
> instead.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Hi Thomas,
I was trying to analyse the performance of TPC-H queries with your patch
and came across following results,
Q9 and Q21 were crashing, both of them had following bt in core dump (I
thought it might be helpful),

#0 0x0000000010757da4 in pfree (pointer=0x3fff78d11000) at mcxt.c:1012
#1 0x000000001032c574 in ExecHashIncreaseNumBatches
(hashtable=0x1003af6da60) at nodeHash.c:1124
#2 0x000000001032d518 in ExecHashTableInsert (hashtable=0x1003af6da60,
slot=0x1003af695c0, hashvalue=2904801109, preload=1 '\001') at
nodeHash.c:1700
#3 0x0000000010330fd4 in ExecHashJoinPreloadNextBatch
(hjstate=0x1003af39118) at nodeHashjoin.c:886
#4 0x00000000103301fc in ExecHashJoin (node=0x1003af39118) at
nodeHashjoin.c:376
#5 0x0000000010308644 in ExecProcNode (node=0x1003af39118) at
execProcnode.c:490
#6 0x000000001031f530 in fetch_input_tuple (aggstate=0x1003af38910) at
nodeAgg.c:587
#7 0x0000000010322b50 in agg_fill_hash_table (aggstate=0x1003af38910) at
nodeAgg.c:2304
#8 0x000000001032239c in ExecAgg (node=0x1003af38910) at nodeAgg.c:1942
#9 0x0000000010308694 in ExecProcNode (node=0x1003af38910) at
execProcnode.c:509
#10 0x0000000010302a1c in ExecutePlan (estate=0x1003af37fa0,
planstate=0x1003af38910, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x1003af19390) at execMain.c:1587

In case you want to know, I was using TPC-H with 20 scale factor. Please
let me know if you want anymore information on this.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-01-12 11:23:36 Re: Transactions involving multiple postgres foreign servers
Previous Message Simon Riggs 2017-01-12 09:46:06 Re: Proposal for changes to recovery.conf API