Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-06 13:57:30
Message-ID: CAEepm=3g=dMG+84083fkFzLvgMJ7HdhbGB=AeZABNukbZm3hpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Wed, Mar 1, 2017 at 10:40 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I'm testing a new version which incorporates feedback from Andres and
> Ashutosh, and is refactored to use a new SharedBufFileSet component to
> handle batch files, replacing the straw-man implementation from the v5
> patch series. I've set this to waiting-on-author and will post v6
> tomorrow.

I created a system for reference counted partitioned temporary files
called SharedBufFileSet: see 0007-hj-shared-buf-file.patch. Then I
ripped out the code for sharing batch files that I previously had
cluttering up nodeHashjoin.c, and refactored it into a new component
called a SharedTuplestore which wraps a SharedBufFileSet and gives it
a tuple-based interface: see 0008-hj-shared-tuplestore.patch. The
name implies aspirations of becoming a more generally useful shared
analogue of tuplestore, but for now it supports only the exact access
pattern needed for hash join batches ($10 wrench).

It creates temporary files like this:

base/pgsql_tmp/pgsql_tmp[pid].[set].[partition].[participant].[segment]

I'm not sure why nodeHashjoin.c is doing raw batchfile read/write
operations anyway; why not use tuplestore.c for that (as
tuplestore.c's comments incorrectly say is the case)? Maybe because
Tuplestore's interface doesn't support storing the extra hash value.
In SharedTuplestore I solved that problem by introducing an optional
fixed sized piece of per-tuple meta-data. Another thing that is
different about SharedTuplestore is that it supports partitions, which
is convenient for this project and probably other parallel projects
too.

In order for workers to be able to participate in reference counting
schemes based on DSM segment lifetime, I had to give the
Exec*InitializeWorker() functions access to the dsm_segment object,
whereas previously they received only the shm_toc in order to access
its contents. I invented ParallelWorkerContext which has just two
members 'seg' and 'toc': see
0005-hj-let-node-have-seg-in-worker.patch. I didn't touch the FDW API
or custom scan API where they currently take toc, though I can see
that there is an argument that they should; changing those APIs seems
like a bigger deal. Another approach would be to use ParallelContext,
as passed into ExecXXXInitializeDSM, with the members that are not
applicable to workers zeroed out. Thoughts?

I got rid of the ExecDetachXXX stuff I had invented in the last
version, because acf555bc fixed the problem a better way.

I found that I needed to put use more than one toc entry for a single
executor node, in order to reserve space for the inner and outer
SharedTuplestore objects. So I invented a way to make more extra keys
with PARALLEL_KEY_EXECUTOR_NTH(plan_node_id, N).

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
parallel-shared-hash-v6.tgz application/x-gzip 52.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-03-06 14:01:01 Re: Change in "policy" on dump ordering?
Previous Message Stephen Frost 2017-03-06 13:52:42 Re: PATCH: Configurable file mode mask