Really bad blowups with hash outer join and nulls

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Cc: gavin(at)shopwindow(dot)me
Subject: Really bad blowups with hash outer join and nulls
Date: 2015-02-15 09:54:21
Message-ID: 87y4nzlmeq.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This came up today on IRC, though I suspect the general problem has been
seen before:

create table m3 (id uuid, status integer);
create table q3 (id uuid);
insert into m3
select uuid_generate_v4(), floor(random() * 4)::integer
from generate_series(1,1000000);
insert into q3
select id
from (select id, row_number() over () as rnum from m3) s
where (rnum % 4)=0;
insert into q3 select null from generate_series(1,250000);

-- at this point m3 has a million unique rows, while q3 has
-- a quarter of the ids of m3, plus as many nulls

-- note that work_mem is at the default 4MB

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=51553.62..51553.67 rows=4 width=4) (actual time=259580.168..259580.179 rows=5 loops=1)
Group Key: m3.status
-> Hash Right Join (cost=12927.31..49596.89 rows=391347 width=4) (actual time=18804.502..258498.897 rows=500000 loops=1)
Hash Cond: (m3.id = q3.id)
-> Seq Scan on m3 (cost=0.00..16753.10 rows=1038310 width=20) (actual time=0.011..1618.762 rows=1000000 loops=1)
-> Hash (cost=6124.47..6124.47 rows=391347 width=16) (actual time=1742.340..1742.340 rows=500000 loops=1)
Buckets: 131072 (originally 131072) Batches: 65536 (originally 8) Memory Usage: 8837kB
-> Seq Scan on q3 (cost=0.00..6124.47 rows=391347 width=16) (actual time=0.033..774.732 rows=500000 loops=1)
Planning time: 0.178 ms
Execution time: 259583.185 ms

The memory usage of this query runs to hundreds of megs despite the
small work_mem.

On 9.3.5, which the user was running, the problem was much more extreme
(this is with the default 1MB work_mem, and a data set only a quarter of
the size):

explain analyze select status, count(*) from q3 left join m3 on (m3.id=q3.id) group by status;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=15020.11..15022.11 rows=200 width=4) (actual time=3733245.942..3733245.952 rows=5 loops=1)
-> Hash Right Join (cost=3976.50..14395.11 rows=125000 width=4) (actual time=227870.000..3732686.692 rows=125000 loops=1)
Hash Cond: (m3.id = q3.id)
-> Seq Scan on m3 (cost=0.00..4189.59 rows=259659 width=20) (actual time=0.024..807.162 rows=250000 loops=1)
-> Hash (cost=1803.00..1803.00 rows=125000 width=16) (actual time=437.991..437.991 rows=125000 loops=1)
Buckets: 4096 Batches: 262144 (originally 8) Memory Usage: 1954kB
-> Seq Scan on q3 (cost=0.00..1803.00 rows=125000 width=16) (actual time=0.011..191.208 rows=125000 loops=1)
Total runtime: 3733269.725 ms
(8 rows)

This query on 9.3.5 allocated over 2.7GB of RAM on my system.

Note the explosion in the number of batches. Tracing execution showed
that as the memory limit was reached while processing the large number
of nulls, the number of batches would be doubled, then only a tiny
number of rows could be written out as being no longer in the current
batch, so the limit was then quickly reached again.

A key part of the problem here may be the fact that nulls hash to
exactly 0, and are therefore always part of the initial in-memory batch
regardless of how much splitting happens. If a large subset of the data
had some other constant value, it would likely have some non-zero bits
in its hash value and therefore stand a good chance of being written out
to a batch file before things get too desperate, avoiding the mass
explosion.

A quick test suggests that initializing the hash value to ~0 rather than
0 has a curious effect: the number of batches still explodes, but the
performance does not suffer the same way. (I think because almost all
the batches end up empty.) I think this is worth doing even in the
absence of a more general solution; nulls are common enough and
important enough that they shouldn't be the worst-case value if it can
be avoided.

(Testing this idea revealed that a regression test for rowsecurity is
missing an order by clause and changes result based on hash values)

--
Andrew (irc:RhodiumToad)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2015-02-15 11:08:21 Re: KNN-GiST with recheck
Previous Message David G Johnston 2015-02-15 09:13:28 Re: restrict global access to be readonly