Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-21 12:07:00
Message-ID: CAEepm=2+zf7L_-eZ5hPW5=US+utdo=9tMVD4wt7ZSM-uOoSxWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here is a new version of the patch series addressing complaints from
Rafia, Peter, Andres and Robert -- see below.

First, two changes not already covered in this thread:

1. Today Robert asked me a question off-list that I hadn't previously
considered: since I am sharing tuples between backends, don't I have
the same kind of transient record remapping problems that tqueue.c has
to deal with? The answer must be yes, and in fact it's a trickier
version because there are N 'senders' and N 'receivers' communicating
via the shared hash table. So I decided to avoid the problem by not
planning shared hash tables if the tuples could include transient
RECORD types: see tlist_references_transient_type() in
0007-hj-shared-single-batch-v8.patch. Perhaps in the future we can
find a way for parallel query to keep local types in sync, so this
restriction could be lifted. (I've tested this with a specially
modified build, because I couldn't figure out how to actually get any
transient types to be considered in a parallel query, but if someone
has a suggestion for a good query for that I'd love to put one into
the regression test.)

2. Earlier versions included support for Shared Hash (= one worker
builds, other workers wait, but we get to use work_mem * P memory) and
Parallel Shared Hash (= all workers build). Shared Hash is by now
quite hard to reach, since so many hash join inner plans are now
parallelisable. I decided to remove support for it from the latest
patch series: I think it adds cognitive load and patch lines for
little or no gain. With time running out, I thought that it would be
better to rip it out for now to try to simplify things and avoid some
difficult questions about how to cost that mode. It could be added
with a separate patch after some more study if it really does make
some sense.

>> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih
>> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a
>>> few regressions,
>>> Q21: 52 secs on HEAD and 400 secs with this patch

As already mentioned there is a planner bug which we can fix
separately from this patch series. Until that is resolved, please see
that other thread[1] for the extra tweak require for sane results when
testing Q21.

Even with that tweak, there was a slight regression with fewer than 3
workers at 1GB for Q21. That turned out to be because the patched
version was not always using as many workers as unpatched. To fix
that, I had to rethink the deadlock avoidance system to make it a bit
less conservative about giving up workers: see
src/backend/utils/misc/leader_gate.c in
0007-hj-shared-single-batch-v8.patch.

Here are some speed-up numbers comparing master to patched that I
recorded on TPCH scale 10 with work_mem = 1GB. These are the queries
whose plans change with the patch. Both master and v8 were patched
with fix-neqseljoin-for-semi-joins.patch.

query | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------
Q3 | 0.94x | 1.06x | 1.25x | 1.46x | 1.64x | 1.87x | 1.99x | 1.67x | 1.67x
Q5 | 1.17x | 1.03x | 1.23x | 1.27x | 1.44x | 0.56x | 0.95x | 0.94x | 1.16x
Q7 | 1.13x | 1.04x | 1.31x | 1.06x | 1.15x | 1.28x | 1.31x | 1.35x | 1.13x
Q8 | 0.99x | 1.13x | 1.23x | 1.22x | 1.36x | 0.42x | 0.82x | 0.78x | 0.81x
Q9 | 1.16x | 0.95x | 1.92x | 1.68x | 1.90x | 1.89x | 2.02x | 2.05x | 1.81x
Q10 | 1.01x | 1.03x | 1.08x | 1.10x | 1.16x | 1.17x | 1.09x | 1.01x | 1.07x
Q12 | 1.03x | 1.19x | 1.42x | 0.75x | 0.74x | 1.00x | 0.99x | 1.00x | 1.01x
Q13 | 1.10x | 1.66x | 1.99x | 1.00x | 1.12x | 1.00x | 1.12x | 1.01x | 1.13x
Q14 | 0.97x | 1.13x | 1.22x | 1.45x | 1.43x | 1.55x | 1.55x | 1.50x | 1.45x
Q16 | 1.02x | 1.13x | 1.07x | 1.09x | 1.10x | 1.10x | 1.13x | 1.10x | 1.11x
Q18 | 1.05x | 1.43x | 1.33x | 1.21x | 1.07x | 1.57x | 1.76x | 1.09x | 1.09x
Q21 | 0.99x | 1.01x | 1.07x | 1.18x | 1.28x | 1.37x | 1.63x | 1.26x | 1.60x

These tests are a bit short and noisy and clearly there are some
strange dips in there that need some investigation but the trend is
positive.

Here are some numbers from some simple test joins, so that you can see
the raw speedup of large hash joins without all the other things going
on in those TPCH plans. I executed 1-join, 2-join and 3-join queries
like this:

CREATE TABLE simple AS
SELECT generate_series(1, 10000000) AS id,
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
ANALYZE simple;

SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id);

SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id)
JOIN simple t USING (id);

SELECT COUNT(*)
FROM simple r
JOIN simple s USING (id)
JOIN simple t USING (id)
JOIN simple u USING (id);

Unpatched master can make probing go faster by adding workers, but not
building, so in these self-joins the ability to scale with more CPUs
is limited (here w = 1 shows the speedup compared to the base case of
w = 0):

joins | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5
-------+-------------+-------+-------+-------+-------+-------
1 | 10746.395ms | 1.46x | 1.66x | 1.63x | 1.49x | 1.36x
2 | 20057.117ms | 1.41x | 1.58x | 1.59x | 1.43x | 1.28x
3 | 30108.872ms | 1.29x | 1.39x | 1.36x | 1.35x | 1.03x

With the patch, scalability is better because extra CPUs can be used
for both of these phases (though probably limited here by my 4 core
machine):

joins | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5
-------+-------------+-------+-------+-------+-------+-------
1 | 10820.613ms | 1.86x | 2.62x | 2.99x | 3.04x | 2.90x
2 | 20348.011ms | 1.83x | 2.54x | 2.71x | 3.06x | 3.17x
3 | 30074.413ms | 1.82x | 2.49x | 2.79x | 3.08x | 3.27x

On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think the synchronization protocol with the various phases needs to be
> documented somewhere. Probably in nodeHashjoin.c's header.

I will supply that shortly.

> Don't we also need to somehow account for the more expensive hash-probes
> in the HASHPATH_TABLE_SHARED_* cases? Seems quite possible that we'll
> otherwise tend to use shared tables for small hashed tables that are
> looked up very frequently, even though a private one will likely be
> faster.

In this version I have two GUCs:

cpu_shared_tuple_cost to account for the extra cost of building a
shared hash table.

cpu_synchronization_cost to account for the cost of waiting for a
barrier between building and probing, probing and unmatched-scan if
outer, and so on for future batches.

I'm not yet sure what their default settings should be, but these
provide the mechanism to discourage the case you're talking about.

On Wed, Mar 8, 2017 at 12:58 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> +static void *dense_alloc(HashJoinTable hashtable, Size size,
> + bool respect_work_mem);
>
> I still dislike this, but maybe Robert's point of:
> ...
> Is enough.

I this version I changed the name to load_(private|shared)_tuple, and
made it return NULL to indicate that work_mem would be exceeded. The
caller needs to handle that by trying to shrink the hash table. Is
this better?

On Fri, Mar 10, 2017 at 3:02 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Thu, Mar 9, 2017 at 4:29 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> Yeah, this seems to fall out of the requirement to manage a growable
>> number of partition files in a fixed space. I wonder how this could
>> go wrong. One way would be for a crash-restart to happen (which
>> leaves all temporary files in place by design, though it could clean
>> them up like a normal restart if it wanted to), followed by a very
>> long running cluster eventually generating the same (pid, set number)
>> pair. I think I see a simple way to defend against that, which I'll
>> write about in the PHJ thread.
>
> I am not expressing any real opinion about the idea of relying on or
> suppressing ENOENT-on-unlink() just yet. What's clear is that that's
> unorthodox. I seldom have any practical reason to make a distinction
> between unorthodox and unacceptable. It's usually easier to just not
> do the unorthodox thing. Maybe this is one of the rare exceptions.

In 0008-hj-shared-buf-file-v8.patch, the problem I mentioned above is
addressed; see make_tagged_segment().

>> Thanks. I will respond with code and comments over on the PHJ thread.
>> Aside from the broken extendBufFile behaviour you mentioned, I'll look
>> into the general modularity complaints I'm hearing about fd.c and
>> buffile.c interaction.
>
> buffile.c should stop pretending to care about anything other than
> temp files, IMV. 100% of all clients that want temporary files go
> through buffile.c. 100% of all clients that want non-temp files (files
> which are not marked FD_TEMPORARY) access fd.c directly, rather than
> going through buffile.c.

I still need BufFile because I want buffering.

There are 3 separate characteristics enabled by flags with 'temporary'
in their name. I think we should consider separating the concerns by
splitting and renaming them:

1. Segmented BufFile behaviour. I propose renaming BufFile's isTemp
member to isSegmented, because that is what it really does. I want
that feature independently without getting confused about lifetimes.
Tested with small MAX_PHYSICAL_FILESIZE as you suggested.

2. The temp_file_limit system. Currently this applies to fd.c files
opened with FD_TEMPORARY. You're right that we shouldn't be able to
escape that sanity check on disk space just because we want to manage
disk file ownership differently. I propose that we create a new flag
FD_TEMP_FILE_LIMIT that can be set independently of the flags
controlling disk file lifetime. When working with SharedBufFileSet,
the limit applies to each backend in respect of files it created,
while it has them open. This seems a lot simpler than any
shared-temp-file-limit type scheme and is vaguely similar to the way
work_mem applies in each backend for parallel query.

3. Delete-on-close/delete-at-end-of-xact. I don't want to use that
facility so I propose disconnecting it from the above. We c{ould
rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to
FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT.

As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts?

[1] https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
parallel-shared-hash-v8.tgz application/x-gzip 61.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-21 12:12:23 Re: Removing binaries
Previous Message Robert Haas 2017-03-21 12:04:11 Re: Patch: Write Amplification Reduction Method (WARM)