Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)
Date: 2016-01-19 03:57:33
Message-ID: 569DB42D.4030302@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/17/2015 10:28 PM, Tomas Vondra wrote:
> Hi,
>
> On 12/17/2015 07:20 PM, Robert Haas wrote:
> ...
>>
>> If this doesn't regress performance in the case where the number of
>> buckets is estimated accurately to begin with, then I think this is
>> a great idea. Can you supply some performance tests results for that
>> case, and maybe some of the other cases also?
>
> I don't see how it could regress performance, and the benchmarks I've
> done confirm that. I'll do more thorough benchmarking and post the
> results here, but not now as this patch is in 2016-01 CF and I want to
> put all my time into reviewing patches from the open commitfest.

I've finally got to do more thorough benchmarks, and surprisingly there
seems to be a minor regression. The scripts I've used are attached,
along with results, but in short it joins two tables, with different
combinations:

1M x 10M
10M x 100M
5M x 250M

with different work_mem sizes (so that the smallest value uses a bunch
of batches while the largest value uses a single batch).

The 1x10 and 10x100 are designed to fit into RAM on the machine
(including the temporary files in batching mode), while 5x250 is
designed to specifically force the temp files to disk (but it's on fast
SSD array, so not a big deal).

Average timings for current master and the first two patches of the
series (0001 postpones the build of buckets and 0002 always starts
without batching) look like this (the timings are in milliseconds):

size work_mem master postpone no-batch
-----------------------------------------------------
1x10 8 5735 5760 6107
32 5939 5955 6124
128 4852 5038 5175
-----------------------------------------------------
5x250 64 158512 158429 159008
256 144046 144584 144994
1024 111478 117529 116933
-----------------------------------------------------
10x100 64 68389 68278 68530
256 66693 66415 66605
1024 48606 50654 50490

So compared to master, the differences look like this:

size work_mem postpone no-batch
-----------------------------------------
1x10 8 0.44% 6.50%
32 0.28% 3.13%
128 3.83% 6.65%
-----------------------------------------
5x250 64 -0.05% 0.31%
256 0.37% 0.66%
1024 5.43% 4.89%
-----------------------------------------
10x100 64 -0.16% 0.21%
256 -0.42% -0.13%
1024 4.21% 3.88%

So for the first patch (postponing building of buckets) with batching,
the difference is rather negligible, possibly within noise (~0.5%). When
the hash join runs with a single batch, the difference however gets much
more significant 4-5%.

I'm not going to discuss the second patch for now, but naturally it has
the same issue and apparently also some additional ones (at least on the
1x10 data set).

I have spent a fair amount of time profiling this, and I can't wrap my
head around where the difference comes from, though. Essentially we
don't do any additional stuff, we've just moved some of the stuff to a
different place in the code.

It might change the memory access pattern a bit, but the original
patterns are not exactly sequential or so, as we're moving random
updates to array of pointers. Or rather moving them to the
BuildBuckets() method, so if something consumes more time, it should be
in this method, I believe.

So I've measured how much time the function takes for the 1x10 and
10x100 datasets, and it's ~25ms and ~290ms respectively. That's way less
than the actual difference in query runtime, which is ~190ms and ~2050ms.

So even if we entirely skipped the bucket build, we would not recover
the difference. Not sure what's going on here.

I've also done some profiling using perf, but I haven't really found
anything suspicious there.

Any ideas what might be the cause of this?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
hash-scripts.tgz application/x-compressed-tar 748 bytes
hashes-delayed.ods application/vnd.oasis.opendocument.spreadsheet 39.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2016-01-19 04:00:21 Re: Combining Aggregates
Previous Message Vitaly Burovoy 2016-01-19 03:45:10 Re: custom function for converting human readable sizes to bytes