Re: Performance improvement for joins where outer side is unique

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2015-08-23 19:31:43
Message-ID: 55DA1F9F.5040204@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I did some initial performance evaluation of this patch today, and I see
a clear improvement on larger joins. The scenario I've chosen for the
experiments is a simple fact-dimension join, i.e. a small table
referenced by a large table. So effectively something like this:

CREATE TABLE dim (id INT PRIMARY KEY, ...);
CREATE TABLE fact (dim_d INT REFERENCES dim(id), ...);

except that I haven't used the foreign key constraint. In all the
experiments I've used a fact table 10x the size of the dimension, but I
believe what really matters most is the size of the dimension (and how
the hash table fits into the L2/L3 cache).

The query tested is very simple:

select count(1) from (
select * from f join d on (f.fact_id = d.dim_id)
) foo;

The outer aggregation is intentional - the join produces many rows and
formatting them as string would completely eliminate any gains from the
patch (even with "\o /dev/null" or such).

The following numbers come current master, running on E5-2630 v3
(2.40GHz), 64GB of RAM and this configuration:

checkpoint_timeout = 15min
effective_cache_size = 48GB
maintenance_work_mem = 1GB
max_wal_size = 4GB
min_wal_size = 1GB
random_page_cost = 1.5
shared_buffers = 4GB
work_mem = 1GB

all the other values are set to default.

I did 10 runs for each combination of sizes - the numbers seem quite
consistent and repeatable. I also looked at the median values.

dim 100k rows, fact 1M rows
---------------------------

master patched
------- -------
1 286.184 265.489
2 284.827 264.961
3 281.040 264.768
4 280.926 267.720
5 280.984 261.348
6 280.878 261.463
7 280.875 261.338
8 281.042 261.265
9 281.003 261.236
10 280.939 261.185
------- -------
med 280.994 261.406 (-7%)

dim 1M rows, fact 10M rows
--------------------------

master patched
-------- --------
1 4316.235 3690.373
2 4399.539 3738.097
3 4360.551 3655.602
4 4359.763 3626.142
5 4361.821 3621.941
6 4359.205 3654.835
7 4371.438 3631.212
8 4361.857 3626.237
9 4357.317 3676.651
10 4359.561 3641.830
-------- --------
med 4360.157 3648.333 (-17%)

dim 10M rows, fact 100M rows
----------------------------

master patched
-------- --------
1 46246.467 39561.597
2 45982.937 40083.352
3 45818.118 39674.661
4 45716.281 39616.585
5 45651.117 40463.966
6 45979.036 41395.390
7 46045.400 41358.047
8 45978.698 41253.946
9 45801.343 41156.440
10 45720.722 41374.688
--------- ---------
med 45898.408 40810.203 (-10%)

So the gains seem quite solid - it's not something that would make the
query an order of magnitude faster, but it's well above the noise.

Of course, in practice the queries will be more complicated, making the
improvement less significant, but I don't think that's a reason not to
apply it.

Two minor comments on the patch:

1) the 'subquery' variable in specialjoin_is_unique_join is unused

2) in the explain output, there should probably be a space before the
'(inner unique)' text, so

Hash Join (inner unique) ...

instead of

Hash Join(inner unique)

but that's just nitpicking at this point. Otherwise the patch seems
quite solid to me.

regard

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
unijoins-test.sql application/sql 1.0 KB
unijoins-queries.sql application/sql 782 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message xpNitin 2015-08-23 22:06:18 AIX 7.1 compile and initdb error TRAP: FailedAssertion
Previous Message Tom Lane 2015-08-23 19:18:19 Re: PostgreSQL for VAX on NetBSD/OpenBSD