Quick Links

Re: Why is a hash join preferred when it does not fit in work_mem

From:	Dimitrios Apostolou <jimis(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: Why is a hash join preferred when it does not fit in work_mem
Date:	2023-01-20 20:01:22
Message-ID:	85365bc8-f35b-be82-2582-1545830f976a@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello again, I am back with new experiments.

First of all, I have a concrete set of steps that replicate the
slowness of the hash join that I described. If you have a system with
spinning disks lying around, I would appreciate if you can verify the
scenario. Can you also replicate it in different kind of systems?

CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text);
INSERT INTO descriptions (description_id, description)
SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4)
FROM generate_series(0,1200300) AS s;
CREATE TABLE descriptions_in_books (description_id integer REFERENCES descriptions(description_id), book_id integer);
INSERT INTO descriptions_in_books (description_id, book_id)
SELECT s % 1200300, s
FROM generate_series(0,5200300) AS s;

SET work_mem TO '1MB';
SET hash_mem_multiplier = 1.0;
SET track_io_timing TO on;

EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books;
SET enable_hashjoin TO off;
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books;

The first JOIN query uses a hash join with Batches: 1024 and takes 622s!
For the longest part, I can see the disk writing most of the time around
1-2 MB/s, so I have to assume it's not writing sequentially.

The second identical JOIN uses a merge join that completes in 14s. The I/O
happens in a much higher rate (10x maybe), so I'm assuming it's mostly
sequential.

Another observation is that the hash join deteriorates as the length of
the TEXT column grows. In fact, if I fill it with only 32 char long
strings, then the hash join is split in only 128 batches, and it completes
almost as fast as the merge join. Could it be that the cost estimation
is underestimating the I/O pattern related to splitting in batches?

Here are the measurements:

Hash Join (cost=192450.84..401456.02 rows=5200486 width=524) (actual time=344516.004..621725.562 rows=5200301 loops=1)
Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id
Inner Unique: true
Hash Cond: (descriptions_in_books.description_id = descriptions.description_id)
Buffers: shared hit=15586 read=93161, temp read=97829 written=97829
I/O Timings: shared/local read=1402.597, temp read=229252.170 write=371508.313
-> Seq Scan on public.descriptions_in_books (cost=0.00..75015.86 rows=5200486 width=8) (actual time=0.068..1819.629 rows=5200301 loops=1)
Output: descriptions_in_books.book_id, descriptions_in_books.description_id
Buffers: shared hit=32 read=22979
I/O Timings: shared/local read=249.910
-> Hash (cost=97739.04..97739.04 rows=1200304 width=520) (actual time=343268.470..343268.471 rows=1200301 loops=1)
Output: descriptions.description_id, descriptions.description
Buckets: 2048 Batches: 1024 Memory Usage: 686kB
Buffers: shared hit=15554 read=70182, temp written=78538
I/O Timings: shared/local read=1152.687, temp write=338883.205
-> Seq Scan on public.descriptions (cost=0.00..97739.04 rows=1200304 width=520) (actual time=0.028..2278.791 rows=1200301 loops=1)
Output: descriptions.description_id, descriptions.description
Buffers: shared hit=15554 read=70182
I/O Timings: shared/local read=1152.687
Settings: hash_mem_multiplier = '1', work_mem = '1MB'
Planning Time: 0.303 ms
Execution Time: 622495.279 ms
(22 rows)

SET enable_hashjoin TO off;

Merge Join (cost=868411.87..1079330.96 rows=5200301 width=524) (actual time=6091.932..13304.924 rows=5200301 loops=1)
Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id
Merge Cond: (descriptions.description_id = descriptions_in_books.description_id)
Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189
I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 write=794.483
-> Index Scan using descriptions_pkey on public.descriptions
(cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 rows=1200301 loops=1)
Output: descriptions.description_id, descriptions.description
Buffers: shared hit=3 read=89015 written=1
I/O Timings: shared/local read=834.732 write=28.575
-> Materialize (cost=868408.84..894410.35 rows=5200301 width=8) (actual time=6091.892..9171.796 rows=5200301 loops=1)
Output: descriptions_in_books.book_id, descriptions_in_books.description_id
Buffers: shared hit=64 read=22947, temp read=45806 written=46189
I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
-> Sort (cost=868408.84..881409.60 rows=5200301 width=8) (actual time=6091.885..7392.828 rows=5200301 loops=1)
Output: descriptions_in_books.book_id, descriptions_in_books.description_id
Sort Key: descriptions_in_books.description_id
Sort Method: external merge Disk: 91632kB
Buffers: shared hit=64 read=22947, temp read=45806 written=46189
I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
-> Seq Scan on public.descriptions_in_books (cost=0.00..75014.01 rows=5200301 width=8) (actual time=0.031..1617.520 rows=5200301 loops=1)
Output: descriptions_in_books.book_id, descriptions_in_books.description_id
Buffers: shared hit=64 read=22947
I/O Timings: shared/local read=172.311
Settings: enable_hashjoin = 'off', hash_mem_multiplier = '1', work_mem = '1MB'
Planning:
Buffers: shared hit=10 read=9
I/O Timings: shared/local read=90.112
Planning Time: 90.774 ms
Execution Time: 13955.462 ms
(29 rows)

Regards,
Dimitris

In response to

Re: Why is a hash join preferred when it does not fit in work_mem at 2023-01-14 16:17:24 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Marc	2023-01-21 15:58:30	Re: Autovacuum on sys tables
Previous Message	Laurenz Albe	2023-01-20 10:08:49	Re: *Conflict with recovery error*