OOM in hash join

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: OOM in hash join
Date: 2023-04-14 10:59:27
Message-ID: 94608e6b-caca-02d2-1bec-9806532c476d@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Too small value of work_mem cause memory overflow in parallel hash join
because of too much number batches.
There is the plan:

explain SELECT * FROM solixschema.MIG_50GB_APR04_G1_H a join
solixschema.MIG_50GB_APR04_G2_H b on a.seq_pk = b.seq_pk join
solixschema.MIG_50GB_APR04_G3_H c on b.seq_p
k = c.seq_pk join solixschema.MIG_50GB_APR04_G4_H d on c.seq_pk =
d.seq_pk join solixschema.MIG_50GB_APR04_G5_H e on d.seq_pk = e.seq_pk;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=205209076.76..598109290.40 rows=121319744 width=63084)
   Workers Planned: 8
   ->  Parallel Hash Join  (cost=205208076.76..585976316.00
rows=15164968 width=63084)
         Hash Cond: (b.seq_pk = a.seq_pk)
         ->  Parallel Hash Join  (cost=55621683.59..251148173.17
rows=14936978 width=37851)
               Hash Cond: (b.seq_pk = c.seq_pk)
               ->  Parallel Hash Join (cost=27797595.68..104604780.40
rows=15346430 width=25234)
                     Hash Cond: (b.seq_pk = d.seq_pk)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g2_h b 
(cost=0.00..4021793.90 rows=15783190 width=12617)
                     ->  Parallel Hash (cost=3911716.30..3911716.30
rows=15346430 width=12617)
                           ->  Parallel Seq Scan on mig_50gb_apr04_g4_h
d  (cost=0.00..3911716.30 rows=15346430 width=12617)
               ->  Parallel Hash  (cost=3913841.85..3913841.85
rows=15362085 width=12617)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g3_h c 
(cost=0.00..3913841.85 rows=15362085 width=12617)
         ->  Parallel Hash  (cost=102628306.07..102628306.07
rows=15164968 width=25233)
               ->  Parallel Hash Join (cost=27848049.61..102628306.07
rows=15164968 width=25233)
                     Hash Cond: (a.seq_pk = e.seq_pk)
                     ->  Parallel Seq Scan on mig_50gb_apr04_g1_h a 
(cost=0.00..3877018.68 rows=15164968 width=12617)
                     ->  Parallel Hash (cost=3921510.05..3921510.05
rows=15382205 width=12616)
                           ->  Parallel Seq Scan on mig_50gb_apr04_g5_h
e  (cost=0.00..3921510.05 rows=15382205 width=12616)

work_mem is 4MB and leader + two parallel workers consumes about 10Gb each.
There are 262144 batches:

(gdb) p *hjstate->hj_HashTable
$2 = {nbuckets = 1024, log2_nbuckets = 10, nbuckets_original = 1024,
  nbuckets_optimal = 1024, log2_nbuckets_optimal = 10, buckets = {
    unshared = 0x7fa5d5211000, shared = 0x7fa5d5211000}, keepNulls =
false,
  skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0,
nSkewBuckets = 0,
  skewBucketNums = 0x0, nbatch = 262144, curbatch = 86506,
  nbatch_original = 262144, nbatch_outstart = 262144, growEnabled = true,
  totalTuples = 122600000, partialTuples = 61136408, skewTuples = 0,
  innerBatchFile = 0x0, outerBatchFile = 0x0,
  outer_hashfunctions = 0x55ce086a3288, inner_hashfunctions =
0x55ce086a32d8,
  hashStrict = 0x55ce086a3328, collations = 0x55ce086a3340, spaceUsed = 0,
  spaceAllowed = 8388608, spacePeak = 204800, spaceUsedSkew = 0,
  spaceAllowedSkew = 167772, hashCxt = 0x55ce086a3170,
  batchCxt = 0x55ce086a5180, chunks = 0x0, current_chunk = 0x7fa5d5283000,
  area = 0x55ce085b56d8, parallel_state = 0x7fa5ee993520,
  batches = 0x7fa5d3ff8048, current_chunk_shared = 1099512193024}

The biggest memory contexts are:

ExecutorState: 1362623568
   HashTableContext: 102760280
    HashBatchContext: 7968
   HashTableContext: 178257752
    HashBatchContext: 7968
   HashTableContext: 5306745728
    HashBatchContext: 7968

There is still some gap between size reported by memory context sump and
actual size of backend.
But is seems to be obvious, that trying to fit in work_mem
sharedtuplestore creates so much batches, that  them consume much more
memory than work_mem.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2023-04-14 11:21:05 Re: OOM in hash join
Previous Message Hayato Kuroda (Fujitsu) 2023-04-14 10:30:27 RE: [PoC] pg_upgrade: allow to upgrade publisher node