Number of buckets in a hash join

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Number of buckets in a hash join
Date: 2013-01-28 10:47:58
Message-ID: 5106575E.4010103@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While testing Alexander's gistchoose patch, "perf report" showed that
the test case spent a surprisingly large amount of CPU time in
ExecScanHashBucket. That function just scans a hash bucket for matches,
and it should be very quick as long as there are not many collisions.

It turns out that the planner estimated the number of rows in the hash
to be much smaller than it actually contained, and the hash table was
initialized with too few buckets as a result. The target is that each
bucket contains 10 tuples (NTUP_PER_BUCKET), but in this case, the
average was about 100.

The first question is, why do we aim at 10 tuples per bucket? My gut
feeling is that that's way too high. I would expect the target to be 1
tuple per bucket, or maybe a little more, like 2-3. Each bucket consumes
one pointer's worth of RAM, which is not much. There's also some
overhead from empty buckets when scanning the hash table, but as long as
all the buckets have at least one entry, there's no gain from having
more than one entry per bucket.

However, lowering NTUP_PER_BUCKET would not have helped in this case,
because we also have a minimum of 1024 buckets. The estimate was so bad
that even after setting NTUP_PER_BUCKET to 1, it was still pegged at
that minimum of 1024 buckets.

Ideally, the planner would always make a good guess the number of rows,
but for the situations that it doesn't, it would be good if the hash
table was enlarged if it becomes too full. It's a well-known technique
to double the size of a hash table once the load factor reaches a
certain threshold, and rehash the existing entries. Another idea is to
just collect all the entries in e.g a linked list when tuples are
inserted to the hash table, and create the buckets lazily, after all the
tuples have been inserted.

Here's an extreme example of this phenomenon. According to perf, about
95% of the CPU time is spent in ExecScanHashBucket. That would be
eliminated by sizing the hash table correctly:

create table numbers (id int4);
insert into numbers select g from generate_series(1, 10000000) g;

explain analyze select * from numbers a, generate_series(1, 100000) b
where b = a.id;
QUERY
PLAN

---------------------------------------------------------------------------------
------------------------------------------------------
Hash Join (cost=22.50..2035430.50 rows=53097600 width=8) (actual
time=32.307..2
9141.348 rows=100000 loops=1)
Hash Cond: (a.id = b.b)
-> Seq Scan on numbers a (cost=0.00..150443.20 rows=10619520
width=4) (actua
l time=0.017..716.503 rows=10000000 loops=1)
-> Hash (cost=10.00..10.00 rows=1000 width=4) (actual
time=32.268..32.268 ro
ws=100000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3516kB
-> Function Scan on generate_series b (cost=0.00..10.00
rows=1000 widt
h=4) (actual time=17.966..22.519 rows=100000 loops=1)
Total runtime: 29146.011 ms
(7 rows)

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Chalke 2013-01-28 11:14:32 Re: pg_dump --pretty-print-views
Previous Message Andres Freund 2013-01-28 10:44:36 Re: logical changeset generation v4 - Heikki's thoughts about the patch state