Re: Avoiding hash join batch explosions with extreme skew and weird stats

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, sbjesse(at)gmail(dot)com
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date: 2020-02-12 00:57:05
Message-ID: CAAKRu_YgscKZuuQoGcG8e=G-+OGYABm=tEkMyQ6GNqC2fmHVRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've implemented avoiding rescanning all inner tuples for each stripe
in the attached patch:
v5-0005-Avoid-rescanning-inner-tuples-per-stripe.patch

Patchset is rebased--and I had my first merge conflicts as I contend
with maintaining this long-running branch with large differences
between it and current hashjoin. I think I'll need to reconsider the
changes I've made if I want to make it maintainable.

As for patch 0005, not rescanning inner tuples for every stripe,
basically, instead of reinitializing the SharedTuplestore for the
inner side for each stripe (I'm using "stripe" from now on, but I
haven't done any retroactive renaming yet) during fallback, each
participant's read_page is set to the beginning of the
SharedTuplestoreChunk which contains the end of one stripe and the
beginning of another.

Previously all inner tuples were scanned and only tuples from the
current stripe were loaded.

Each SharedTuplestoreAccessor now has a variable "start_page", which
is initialized when it is assigned its read_page (which will always be
the beginning of a SharedTuplestoreChunk).

While loading tuples into the hashtable, if a tuple is from a past
stripe, the worker skips it (that will happen when a stripe straddles
two SharedTuplestoreChunks). If a tuple is from the future, the worker
backs that SharedTuplestoreChunk out and sets the shared read_page (in
the shared SharedTuplestoreParticipant) back to its start_page.

There are a couple mechanisms to provide for synchronization that
address specific race conditions/synchronization points -- those
scenarios are laid out in the commit message.

The first is a rule that a worker can only set read_page to a
start_page which is less than the current value of read_page.

The second is a "rewound" flag in the SharedTuplestoreParticipant. It
indicates if this participant has been rewound during loading of the
current stripe. If it has, a worker cannot be assigned a
SharedTuplestoreChunk. This flag is reset between stripes.

In this patch, Hashjoin makes an unacceptable intrusion into the
SharedTuplestore API. I am looking for feedback on how to solve this.

Basically, because the SharedTuplestore does not know about stripes or
about HashJoin, the logic to decide if a tuple should be loaded into a
hashtable or not is in the stripe phase machine where tuples are loaded
into the hashtable.

So, to ensure that workers have read from all participant files before
assuming all tuples from a stripe are loaded, I have duplicated the
logic from sts_parallel_scan_next() which has workers get the next
participant file and added it into in the body of the tuple loading
loop in the stripe phase machine (see sts_ready_for_next_stripe() and
sts_seen_all_participants()).

This clearly needs to be fixed and it is arguable that there are other
intrusions into the SharedTuplestore API in these patches.

One option is to write each stripe for each participant to a different
file, preserving the idea that a worker is done with a read_file when it
is at EOF.

Outside of addressing the relationship between SharedTuplestore,
stripes, and Hashjoin, I have re-prioritized the next steps for the
patch as follows:

Next Steps:
1) Rename "chunk" to "stripe"
1) refine fallback logic
3) refactor code to make it easier to keep it rebased
4) EXPLAIN ANALYZE instrumentation to show stripes probed by workers
5) anti/semi-join support

1)
The chunk/stripe thing is becoming extremely confusing.

2)
I re-prioritized refining the fallback logic because the premature
disabling of growth in serial hashjoin is making the join_hash test so
slow that it is slowing down iteration speed for me.

3)
I am wondering if Thomas Munro's idea to template-ize Hashjoin [1]
would make maintaining the diff easier, harder, or no different. The
code I've added made the main hashjoin state machine incredibly long,
so I broke it up into Parallel Hashjoin and Serial Hashjoin to make it
more manageable. This, of course, lends itself to difficult rebasing
(luckily only one small commit has been made to nodeHashjoin.c). If
the template-ization were to happen sooner, I could refactor my code
so that there were at least the same function names and the diffs
would be more clear.

4)
It is important that I have some way of knowing if I'm even exercising
code that I'm adding that involves multiple workers probing the same
stripes. As I make changes to the code, even though it will not
necessarily be deterministic, I can change the tests if I am no longer
able to get any of the concurrent behavior I'm looking for.

5)
Seems like it's time

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJjs6H77u%2BPL3ovMSowFZ8nib9Z%2BnHGNF6YNmw6osUU%2BA%40mail.gmail.com

--
Melanie Plageman

Attachment Content-Type Size
v5-0002-Fixup-tupleMetadata-struct-issues.patch application/octet-stream 13.5 KB
v5-0003-Address-barrier-wait-deadlock-hazard.patch application/octet-stream 29.8 KB
v5-0004-Add-SharedBits-API.patch application/octet-stream 32.7 KB
v5-0005-Avoid-rescanning-inner-tuples-per-stripe.patch application/octet-stream 14.4 KB
v5-0001-Implement-Adaptive-Hashjoin.patch application/octet-stream 147.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2020-02-12 01:20:05 Re: Adding a test for speculative insert abort case
Previous Message tsunakawa.takay@fujitsu.com 2020-02-12 00:55:50 RE: POC: GUC option for skipping shared buffers in core dumps