Performance improvement for joins where outer side is unique

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Performance improvement for joins where outer side is unique
Date: 2014-12-31 13:47:49
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers


I've been hacking a bit at the join code again... This time I've been
putting some effort into optimising the case where the inner side of the
join is known to be unique.
For example, given the tables:

create table t1 (id int primary key);
create table t2 (id int primary key);

And query such as:

select * from t1 left outer join t2 on;

It is possible to deduce at planning time that "for each row in the outer
relation, only 0 or 1 rows can exist in the inner relation", (inner being

In all of our join algorithms in the executor, if the join type is SEMI,
we skip to the next outer row once we find a matching inner row. This is
because we don't want to allow duplicate rows in the inner side to
duplicate outer rows in the result set. Obviously this is required per SQL
spec. I believe we can also skip to the next outer row in this case when
we've managed to prove that no other row can possibly exist that matches
the current outer row, due to a unique index or group by/distinct clause
(for subqueries).

I've attached a patch which aims to prove this idea works. Although so far
I've only gotten around to implementing this for outer joins.

Since code already existed in analyzejoin.c which aimed to determine if a
join to a relation was unique on the join's condition, the patch is pretty
much just some extra plumbing and a small rewire of analyzejoin.c, which
just aims to get the "unique_inner" bool value down to the join node.

The performance improvement is somewhere around 5-10% for hash joins, and a
bit less for merge join. In theory it could be 50% for nested loop joins.
In my life in the OLTP world, these "unique joins" pretty much cover all
joins that ever happen ever. Perhaps the OLAP world is different, so from
my point of view this is going to be a very nice performance gain.

I'm seeing this patch (or a more complete version) as the first step to a
whole number of other planner improvements:

A couple of examples of those are:

explain select * from sales s inner join product p on p.productid =
s.productid order by s.productid,;

The current plan for this looks like:
Sort (cost=149.80..152.80 rows=1200 width=46)
Sort Key: s.productid,
-> Hash Join (cost=37.00..88.42 rows=1200 width=46)
Hash Cond: (s.productid = p.productid)
-> Seq Scan on sales s (cost=0.00..31.40 rows=2140 width=8)
-> Hash (cost=22.00..22.00 rows=1200 width=38)
-> Seq Scan on product p (cost=0.00..22.00 rows=1200

But in reality we could have executed this with a simple merge join using
the PK of product (productid) to provide presorted input. The extra sort
column on is redundant due to productid being unique.

The UNION planning is crying out for help for cases like this. Where it
could provide sorted input on a unique index, providing the union's
targetlist contained all of the unique index's columns, and we also managed
to find an index on the other part of the union on the same set of columns,
then we could perform a Merge Append and a Unique. This would cause a
signification improvement in execution time for these types of queries, as
the current planner does an append/sort/unique, which especially sucks for
plans with a LIMIT clause.

I think this should solve some of the problems that Kyotarosan encountered
in his episode of dancing with indices over here ->
where he was unable to prove that he could trim down sort nodes once all of
the columns of a unique index had been seen in the order by due to not
knowing if joins were going to duplicate the outer rows.

It should also be possible to reuse a join in situations like:
create view product_sales as select p.productid,sum(s.qty) soldqty from
product p inner join sales s group by p.productid;

Where the consumer does:

select ps.*,p.current_price from product_sales ps inner join product p on
ps.productid = p.productid;

Here we'd currently join the product table twice, both times on the same
condition, but we could safely not bother doing that if we knew that the
join could never match more than 1 row on the inner side. Unfortunately I
deal with horrid situations like this daily, where people have used views
from within views, and all the same tables end up getting joined multiple
times :-(

Of course, both 1 and 2 should be considered separately from the attached
patch, this was just to show where it might lead.

Performance of the patch are as follows:

Test case:
create table t1 (id int primary key);
create table t2 (id int primary key);

insert into t1 select x.x from generate_series(1,1000000) x(x);
insert into t2 select x.x from generate_series(1,1000000) x(x);

vacuum analyze;

select count( from t1 left outer join t2 on;

Values are in transactions per second.

JOIN type Patched Unpatched new runtime
hash 1.295764 1.226188 94.63%
merge 1.786248 1.776605 99.46%
nestloop 0.465356 0.443015 95.20%

Things improve a bit more when using a varchar instead of an int:

hash 1.198821 1.102183 91.94%
I've attached the full benchmark results so as not to spam the thread.

Comments are welcome


David Rowley

Attachment Content-Type Size
unijoin_benchmark.txt text/plain 4.8 KB
unijoins_v1.patch application/octet-stream 22.9 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2014-12-31 14:00:50 Re: Using 128-bit integers for sum, avg and statistics aggregates
Previous Message Magnus Hagander 2014-12-31 13:27:46 Re: Additional role attributes && superuser review