Unexpectedly huge memory usage (over 180x of work_mem) during hash join... confirmed by TopMemoryContext results (postgresql 13.10)

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Unexpectedly huge memory usage (over 180x of work_mem) during hash join... confirmed by TopMemoryContext results (postgresql 13.10)
Date: 2023-04-21 09:13:32
Message-ID: CAK-MWwQFXw5KgvdYtWLCBsTGaeAn4Xx8_wGs8XuM0KLxLoVzew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

During investigation of the OOM server crash (with more than 30GB available
memory) I found it was triggered by a single query.

related settings:
work_mem=32MB
hash_mem_multiplier=1
max_parallel_workers_per_gather=3

Now the result of the investigation performed on an idle server with no
other activity on it.

Now to results:

Query and explain (analyze, costs, buffers, timing) results:

explain (analyze, costs, buffers, timing) select * from employer_manager
WHERE employer_id IN (
SELECT employer_id
FROM
employer e
WHERE
EXISTS (
SELECT 1
FROM employer_service es
JOIN account_service_package_entry aspe
ON (aspe.service_package_id = es.employer_service_id)
JOIN account_service_event ase
ON (ase.account_service_package_entry_id = aspe.id)
JOIN vacancy v
ON (ase.object_id = v.vacancy_id)
WHERE v.employer_id = e.employer_id AND es.price = 0
)
);

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=12483578.55..12699754.67 rows=4951913 width=183) (actual
time=402888.266..541201.083 rows=3456146 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=10227924, temp read=5404033 written=9746488
-> Parallel Hash Semi Join (cost=12483478.55..12650135.54 rows=1597391
width=183) (actual time=402879.754..403927.356 rows=864036 loops=4)
Hash Cond: (employer_manager.employer_id = v.employer_id)
Buffers: shared hit=10227924, temp read=5404033 written=9746488
-> Parallel Seq Scan on employer_manager (cost=0.00..61704.37
rows=3902987 width=183) (actual time=0.015..926.040 rows=3030730 loops=4)
Buffers: shared hit=226745
-> Parallel Hash (cost=12470696.28..12470696.28 rows=1173692
width=8) (actual time=400745.411..400757.615 rows=315704 loops=4)
Buckets: 1048576 Batches: 8 Memory Usage: 14432kB
Buffers: shared hit=10000933, temp read=5185403
written=9531604
-> Parallel Hash Semi Join (cost=5419572.63..12470696.28
rows=1173692 width=8) (actual time=381755.186..400673.065 rows=315704
loops=4)
Hash Cond: (e.employer_id = v.employer_id)
Buffers: shared hit=10000933, temp read=5185403
written=9527768
-> Parallel Seq Scan on employer e
(cost=0.00..65876.37 rows=2986447 width=4) (actual time=0.015..547.631
rows=2313235 loops=4)
Buffers: shared hit=360119
-> Parallel Hash (cost=4389870.76..4389870.76
rows=94549378 width=4) (actual time=300367.731..300370.699 rows=10515574
loops=4)
Buckets: 1048576 (originally 1048576) Batches:
131072 (originally 512) Memory Usage: 287136kB
Buffers: shared hit=9640814, temp read=4068338
written=7441540
-> Parallel Hash Join
(cost=1522656.32..4389870.76 rows=94549378 width=4) (actual
time=113906.860..121650.966 rows=10515574 loops=4)
Hash Cond: (aspe.service_package_id =
es.employer_service_id)
Buffers: shared hit=9640814, temp
read=3570214 written=3573964
-> Parallel Hash Join
(cost=1139337.57..3828356.61 rows=132999548 width=8) (actual
time=87792.230..102392.631 rows=60107033 loops=4)
Hash Cond:
(ase.account_service_package_entry_id = aspe.id)
Buffers: shared hit=8740464, temp
read=2645196 written=2648224
-> Parallel Hash Join
(cost=669912.02..3179027.68 rows=132999548 width=8) (actual
time=44539.790..74472.198 rows=60107033 loops=4)
Hash Cond: (ase.object_id =
v.vacancy_id)
Buffers: shared hit=7347306,
temp read=1656073 written=1657576
-> Parallel Seq Scan on
account_service_event ase (cost=0.00..2275263.13 rows=172006123 width=8)
(actual time=0.011..20798.324 rows=133306300 loops=4)
Buffers: shared
hit=5552019
-> Parallel Hash
(cost=414267.15..414267.15 rows=23473845 width=8) (actual
time=7923.194..7923.195 rows=18198520 loops=4)
Buckets: 1048576
Batches: 128 Memory Usage: 30528kB
Buffers: shared
hit=1795287, temp written=247932
-> Parallel Seq Scan on
vacancy v (cost=0.00..414267.15 rows=23473845 width=8) (actual
time=0.011..5221.328 rows=18198520 loops=4)
Buffers: shared
hit=1795287
-> Parallel Hash
(cost=297333.92..297333.92 rows=15801812 width=8) (actual
time=4352.140..4352.141 rows=12280833 loops=4)
Buckets: 1048576 Batches: 128
Memory Usage: 23296kB
Buffers: shared hit=1393158,
temp written=167644
-> Parallel Seq Scan on
account_service_package_entry aspe (cost=0.00..297333.92 rows=15801812
width=8) (actual time=0.015..2520.045 rows=12280833 loops=4)
Buffers: shared
hit=1393158
-> Parallel Hash
(cost=258846.24..258846.24 rows=11429325 width=4) (actual
time=3424.467..3424.468 rows=8908616 loops=4)
Buckets: 1048576 Batches: 64
Memory Usage: 30080kB
Buffers: shared hit=900350, temp
written=103252
-> Parallel Seq Scan on
employer_service es (cost=0.00..258846.24 rows=11429325 width=4) (actual
time=0.010..2087.361 rows=8908616 loops=4)
Filter: (price = 0)
Rows Removed by Filter: 3624132
Buffers: shared hit=900350
Planning:
Buffers: shared hit=236
Planning Time: 25.494 ms
Execution Time: 541359.085 ms

There are no big missestimations in rows, but during moments of high memory
usage each parallel worker used over 5GB of RAM (so total over 20GB of RAM
3 worker + leader).

Memory context results of single worker:

TopMemoryContext: 3357032 total in 16 blocks; 46064 free (16 chunks);
3310968 used
...
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalHoldContext: 24632 total in 2 blocks; 7384 free (0 chunks); 17248
used
PortalContext: 540672 total in 69 blocks; 8568 free (10 chunks); 532104
used:
ExecutorState: 1082567568 total in 124072 blocks; 4073608 free (7018
chunks); 1078493960 used
HashTableContext: 74808 total in 5 blocks; 25008 free (3 chunks);
49800 used
HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks);
264 used
HashTableContext: 74808 total in 5 blocks; 25056 free (3 chunks);
49752 used
HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks);
264 used
HashTableContext: 40960 total in 3 blocks; 12328 free (6 chunks);
28632 used
HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks);
264 used
HashTableContext: 4156883208 total in 124074 blocks; 146264 free
(30 chunks); 4156736944 used
HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks);
264 used
HashTableContext: 8192 total in 1 blocks; 4376 free (7 chunks);
3816 used
HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks);
264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
...
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
Relcache by OID: 16384 total in 2 blocks; 1424 free (2 chunks); 14960 used
CacheMemoryContext: 1163704 total in 14 blocks; 197832 free (0 chunks);
965872 used
relation rules: 24576 total in 6 blocks; 4896 free (1 chunks); 19680
used: pg_settings
index info: 2048 total in 2 blocks; 528 free (1 chunks); 1520 used:
pg_toast_2619_index
....
48 more child contexts containing 91136 total in 83 blocks; 27632 free
(45 chunks); 63504 used
PrivateRefCount: 8192 total in 1 blocks; 2616 free (0 chunks); 5576 used
MdSmgr: 8192 total in 1 blocks; 3880 free (2 chunks); 4312 used
LOCALLOCK hash: 32768 total in 3 blocks; 16824 free (8 chunks); 15944 used
Timezones: 104120 total in 2 blocks; 2616 free (0 chunks); 101504 used
ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
Grand total: 5245755200 bytes in 248596 blocks; 4995280 free (7224 chunks);
5240759920 used

Also during query execution there was huge amount of small temp files:
find ~postgres/13/main/base/pgsql_tmp/ -type f | wc -l
1459383

There are two questions:
1)is this behavior normal/expected?
2)is there any way to limit memory usage of such queries to prevent OOM in
random times
(memory usage of almost 200x of work_mem very surprised me).

PS: after crash recovery of a database with a few millions of files in
pgsql_tmp takes ages (however it is fixed in newer versions so it's not a
critical issue).

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

Browse pgsql-general by date

  From Date Subject
Next Message Amit Kapila 2023-04-21 09:16:51 Re: Support logical replication of DDLs
Previous Message Laurenz Albe 2023-04-21 06:39:02 Re: vacuum TOAST tables