Re: [HACKERS] Parallel Hash take II

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: [HACKERS] Parallel Hash take II
Date: 2017-11-22 11:36:56
Message-ID: CAEepm=1k+4828Nk++BdfwBgZ_fQEuNuQOhURKWZR=z_7b=gwDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres and Peter,

Here's a new patch set with responses to the last batch of review comments.

On Wed, Nov 15, 2017 at 10:24 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. The way you access this doesn't quite seem right:
> ...
> + matches := regexp_matches(line, ' Batches: ([0-9]+)');
> ...
>
> Why not use format json and access the output that way? Then you can be
> sure you access the right part of the tree and such?

Okay, done.

>> 1. It determines the amount of unfairness when we run out of data:
>> it's the maximum amount of extra data that the unlucky last worker can
>> finish up with after all the others have finished. I think this
>> effect is reduced by higher level factors: when a reader runs out of
>> data in one backend's file, it'll start reading another backend's
>> file. If it's hit the end of all backends' files and this is an outer
>> batch, Parallel Hash will just go and work on another batch
>> immediately.
>
> Consider e.g. what happens if there's the occasional 500MB datum, and
> the rest's very small...
>
>> Better ideas?
>
> Not really. I'm more than a bit suspicous of this solution, but I don't
> really have a great suggestion otherwise. One way to combat extreme
> size skew would be to put very large datums into different files.

Right. I considered opening a separate file for each chunk size (4
page, 8 page, 16 page, ...). Then each file has uniform chunk size,
so you're not stuck with a large chunk size after one monster tuple
comes down the pipe. I didn't propose that because it leads to even
more file descriptors being used. I'd like to move towards fewer file
descriptors, because it's a cap on the number of partitions you can
reasonably use. Perhaps in future we could develop some kind of
general purpose file space manager that would let us allocate extents
within a file, and then SharedTuplestore could allocate extents for
each chunk size. Not today though.

> But I think we probably can go with your approach for now, ignoring my
> failure prone spidey senses ;)

Cool.

>> > This looks more like the job of an lwlock rather than a spinlock.
>>
>> ... assembler ...
>>
>> That should be OK, right?
>
> It's not too bad. Personally I'm of the opinion though that pretty much
> no new spinlocks should be added - their worst case performance
> characteristics are bad enough for that to be only worth the
> experimentation in case swhere each cycle really matters and where
> contention is unlikely.

Changed to LWLock.

>> > One day we're going to need a better approach to this. I have no idea
>> > how, but this per-node, and now per_node * max_parallelism, approach has
>> > only implementation simplicity as its benefit.
>>
>> I agree, and I am interested in that subject. In the meantime, I
>> think it'd be pretty unfair if parallel-oblivious hash join and
>> sort-merge join and every other parallel plan get to use work_mem * p
>> (and in some cases waste it with duplicate data), but Parallel Hash
>> isn't allowed to do the same (and put it to good use).
>
> I'm not sure I care about fairness between pieces of code ;)

I'll leave that discussion to run in a separate subthread...

>> BTW this is not per-tuple code -- it runs once at the end of hashing.
>> Not sure what you're looking for here.
>
> It was more a general statement about all the branches in nodeHashjoin,
> than about these specific branches. Should've made that clearer. There's
> definitely branches in very common parts:
>
> ...
>
> I don't think you should do so now, but I think a reasonable approach
> here would be to move the HJ_BUILD_HASHTABLE code into a separate
> function (it really can't be hot). Then have specialized ExecHashJoin()
> versions for parallel/non-parallel and potentially for outer/inner/anti.

Okay, here's my proposal for how to get new branches out of the
per-tuple path. I have separated ExecHashJoin() and
ExecParallelHashJoin() functions. They call an inline function
ExecHashJoinImpl() with a constant parameter, so that I effectively
instantiate two variants with the unwanted branches removed by
constant folding. Then the appropriate variant is installed as the
ExecProcNode function pointer.

Just "inline" wasn't enough though. I defined
pg_attribute_always_inline to force that on GCC/Clang et al and MSVC.

I also created a separate function ExecParallelScanHashBucket(). I
guess I could have extended the above trick into ExecScanHashBucket()
and further too, but it's called from a different translation unit,
and I also don't want to get too carried away with this technique. I
chose to supply different functions.

So -- that's introducing a couple of new techniques into the tree.
Treating ExecProcNode as a configuration point for a single node type,
and the function instantiation trick. Thoughts?

>> > If we don't split this into two versions, we at least should store
>> > hashNode->parallel_state in a local var, so the compiler doesn't have to
>> > pull that out of memory after every external function call (of which
>> > there are a lot). In common cases it'll end up in a callee saved
>> > registers, and most of the called functions won't be too register
>> > starved (on x86-64).
>>
>> Hmm. Well I did that already in v24 -- in many places there is now a
>> local variable called pstate.
>
> See above piece of code, and a few others, in nodeHash.

I tackled some more of these.

>> > I think it'd be better if we structured the file so we just sat guc's
>> > with SET LOCAL inside a transaction.
>>
>> I wrapped the whole region of join.sql concerned with hash joins in a
>> transaction that rolls back, so I don't have to write LOCAL. That's
>> just as good, right?
>
> Not really imo. Being able to read a test without going through all
> previous ones is a lot better.

Added savepoint/rollback to savepoint around each individual test.
You still need to do the setup at the top of the section (create
tables etc, set a couple of gucs).

On Sat, Nov 18, 2017 at 10:55 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> (I've commit some of the preliminary work)

Thanks!

> Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:

[For things quoted and commented on by Peter, see further down]

> - There seems to be a moment where could leak temporary file
> directories:
> [...]
> PathNameCreateTemporaryDir(tempdirpath, filesetpath);
> file = PathNameCreateTemporaryFile(path, true);
> [...]
>
> The resowner handling is done in PathNameCreateTemporaryFile(). But if
> we fail after creating the directory we'll not have a resowner for
> that. That's probably not too bad.

It doesn't matter because the resowner handling in
PathNameCreateTemporaryFile() is only concerned with closing file
handles, not deleting stuff. In this world cleanup is done by
SharedFileSet via DSM detach hooks, and that directory is already
covered.

> - related to the last point, I'm kinda wondering why we need sub-fileset
> resowners? Given we're dealing with error paths in resowners I'm not
> quite seeing the point - we're not going to want to roll back
> sub-parts of of a fileset, no?

It's just to make sure the individual file handles don't leak. Even
though this system takes care of deleting all the files and
directories, it's still a programming error to forget to close the
BufFile objects, and we want the resowner machinery to complain about
leaked File objects if you forget (unless error path, in which case it
just silently closes them). That's not a change.

> - If we want to keep these resowners, shouldn't we unregister them in
> PathNameDeleteTemporaryFile?

Unlinking the file is entirely separate from having it open. See above.

> - PathNameCreateTemporaryFile() and OpenTemporaryFile() now overlap
> quite a bit. Can't we rejigger things to base the second on the first?
> At the very least the comments need to work out the difference more
> closely.

Ok, I have refactored this. There is now a static function
RegisterTemporaryFile() which contains the common part called by
OpenTemporaryFile(), PathNameCreateTemporaryFile() and
PathNameOpenTemporaryFile(). That ensures the file handle is closed
automatically via *two* mechanisms: resowner and AtEOXact_Files.

Those three functions put different flags into fdstate though:

OpenTemporaryFile(): FD_DELETE_AT_CLOSE | FD_TEMP_FILE_LIMIT
PathNameCreateTemporaryFile(): FD_TEMP_FILE_LIMIT
PathNameOpenTemporaryFile(): 0

Explanation: The file handles get closed automatically in all three
cases (unless disabled with interXact). Only the anonymous private
temporary files get deleted at close. Named (shared) temporary files
are the caller's problem, and in this case SharedFileSet will unlink
them when it deletes the tree they live it. Only files you created
count against your temporary file limit. If you open a file someone
else created it we don't double-count it.

On Sat, Nov 18, 2017 at 11:22 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Nov 17, 2017 at 1:55 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Looking at 0005-Add-infrastructure-for-sharing-temporary-files-betwe.patch:
>> - The created path/filenames seem really redundant:
>> base/pgsql_tmp/pgsql_tmp11160.9.sharedfileset.d/pgsql_tmp.o3of8.p0.0
>>
>> Including pgsql_tmp no less than three times seems a bit absurd.
>>
>> I'm quite inclined to just remove all but the first.
>
> +1

Yeah. That was because I wanted to apply the same function
recursively, and it only deletes things beginning with the prefix. In
this version I pass down a flag to say that it should delete
everything after the first level. That means that I removed the last
one from your example but kept the first two. I had only added the
third. The first two are prior art.

Now it looks like this:

base/pgsql_tmp/pgsql_tmp84528.0.sharedfileset/o999of1024.p1.0

If you create stuff in there that doesn't start with pgsql_tmp then at
startup if refuses to delete it. Someone might value that ancient
feature, so it should probably be discussed somewhere more visible
than this and handled in a different commit if you want to change
that, no? Example:

2017-11-22 19:23:31.055 NZDT [84606] LOG: unexpected file found in
temporary-files directory: "base/pgsql_tmp/random_name"

>> - It's not clear to me why it's correct to have the vfdP->fdstate & FD_TEMPORARY
>> handling in FileClose() be independent of the file being deleted. At
>> the very least there needs to be a comment explaining why we chose
>> that behaviour.
>
> Isn't that just because only one backend is supposed to delete the
> file, but they all must close it and do temp_file_limit accounting?
> Sorry if I missed something (my explanation seems too obvious to be
> correct).

Stepping back: the goal of this exercise is to split up or replace
different behaviours associated with temporary files so they could be
enabled individually. Deleting files: now done at DSM segment
destruction. Closing file handles: done in every backend. Tracking
temp file space usage: only done in backends that created a file
(private or shared), but not in backends that merely opened a file
created by someone else. Does this make sense?

>> - I think we need to document somehwere that the temp_file_limit in a
>> shared file set applies independently for each participant that's
>> writing something. We also should discuss whether that's actually
>> sane behaviour.
>
> This is already the documented behavior of temp_file_limit, fwiw.

We could definitely come up with something better. My proposal is
that the each backend gets to create temp_file_limit worth of
temporary stuff and hold a file descriptor open. When the file
descriptor is closed, it stops counting against the creator's file
limit, but it still exists. It exists until someone deletes it, which
happens on rescan or end of join in my current patch. It needs to
keep existing so that other people can open it. A better system would
be one that applies temp_file_limit to your *session*, that is, your
leader process + any workers it might create. That would require a
lot more book-keeping and communication than I have: writing to a file
would have to consume from a shared-memory counter or something. Do
you think we need to do that?

> Another question for Thomas: Is it okay that routines like
> BufFileOpenShared() introduce new palloc()s (not repalloc()s) to
> buffile.c, given that struct BufFile already contains this?:
>
> /*
> * resowner is the ResourceOwner to use for underlying temp files. (We
> * don't need to remember the memory context we're using explicitly,
> * because after creation we only repalloc our arrays larger.)
> */
> ResourceOwner resowner;
>
> Maybe we need to remember the original caller's memory context, too?
> Either that, or the contract/comments for memory contexts need to be
> revised.

BufFileCreateShared (new), BufFileOpenShared (new), BufFileCreateTemp
(old) are all constructors of BufFile objects. They all use palloc
and thus capture the caller's CurrentMemoryContext implicitly. I
continue that tradition in the new functions. The new kinds of
BufFile don't ever allocate any more memory, but if they ever did
they'd need to either use repalloc or start capturing the memory
context during construction explicitly.

On Sat, Nov 18, 2017 at 12:20 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> 0002-Add-a-barrier-primitive-for-synchronizing-backends.patch
> - Intro starts with:
> + *
> + * This implementation of barriers allows for static sets of participants
> + * known up front, or dynamic sets of participants which processes can join or
> that's a bit low-level, no? Should explain what they're for, not
> everyone's going to be familiar with the concept.

I have added a one sentence description from the Wikipedia article
(with attribution). We reference Wikipedia in a couple of other
places.

> - The comment for BarrierInit() reads a bit weird:
> + * Initialize this barrier, setting a static number of participants that we
> + * will wait for at each computation phase. To use a dynamic number of
> + * participants, this number should be zero, and BarrierAttach and
> Set a static number! Except maybe not?

Fixed.

> - I'm not entirely convinced that we want the barrier debug stuff
> merged, what are your feelings about that? It's like half the code,
> and adds some complexity to the non debug code... If we indeed want to
> keep it, it probably should be documented in config.sgml? And get an
> entry in pg_config_manual.h?

I found it useful during development, but I've ripped it out of this version.

> - Can we add assertions ensuring nobody attaches to a static barrier?

Done.

> - If I understand correctly currently the first participant to arrive at
> a barrier is going to be selected, and the last wakes everyone
> up. Wouldn't it be better to do have the last arrived participant be
> selected? It already got a scheduler timeslice, it's memory is most
> likely to be in cache etc? Given that in cases where selection plays a
> role it's going to be blocking everyone else, using the process most
> likely to finish first seems like a good idea. I guess the
> BarrierDetach() implementation would be slightly more complex?

Good idea. Now done that way. BarrierDetach doesn't get more
complicated, but BarrierArriveAndWait() needs to be prepared to be
elected when it's woken up, just in case the barrier only advanced
because of someone detaching. I changed to a model where instead of
setting and clearing a flag, I record the phase number when someone is
elected. If that doesn't match the current phase then no one has been
elected yet. Also, I standardised on 'elect' as the word for this.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
parallel-hash-v26.patchset.tgz application/x-gzip 62.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-11-22 11:50:57 Re: [PATCH] using arc4random for strong randomness matters.
Previous Message Amit Khandekar 2017-11-22 11:03:19 Re: [HACKERS] UPDATE of partition key