Docker + postgreSQL : OOM killing in a large Group by operation

From: Jorge Daniel <elgaita(at)hotmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Docker + postgreSQL : OOM killing in a large Group by operation
Date: 2018-04-05 17:02:18
Message-ID: SC1P15201MB2461D3920C38C6C0FA8AE0ADBCBB0@SC1P15201MB2461.LAMP152.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Guys:
I have a problem with a query that grabs a bunch of rows and then does an aggreate operation, at that moment it gots killed by OOM-killer, I don't know why, the engine starts using tmpfiles as expected , and then tries to work in memory and gots killed.
I've test it in an small enviroment for more dramatic/quick results

Env:
Running on a 1GB memory Docker-container .

PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
Database Analyzed and the stats are up-to-date

postgresql.conf:
shared_buffers = 512MB
work_mem = 16MB
effective_cache_size = 256MB

Bad Query:

SELECT count(*)
FROM "changelog_change_transaction"
INNER JOIN "changelog_change_stats" ON ( changelog_change_stats.changelog_change_transactionid = changelog_change_transaction.changelog_change_transactionid )
LEFT OUTER JOIN "changelog_change_group" ON ( changelog_change_transaction.changelog_change_groupid = changelog_change_group.changelog_change_groupid )

WHERE ( changelog_change_group.companyid = 40 OR changelog_change_group.companyid = 1 OR changelog_change_group.companyid = 53 OR changelog_change_group.companyid IS NULL )
AND changelog_change_transaction.started_at > '2017-04-21'
GROUP BY "changelog_change_transaction"."changelog_change_transactionid", "changelog_change_transaction"."epoch", "changelog_change_transaction"
."changelog_change_groupid", "changelog_change_transaction"."started_at", "changelog_change_transaction"."duration_microseconds", "changelog_change_transaction"."changed_items", "changelog_change_transaction"."xmin"
;

Explain :

HashAggregate (cost=7845766.73..8117654.17 rows=27188744 width=152)
Group Key: changelog_change_transaction.changelog_change_transactionid, changelog_change_transaction.epoch, changelog_change_transaction.changelog_change_groupid, changelog_change_transaction.started_at, changelog_change_transaction.duration_microseconds, changelog_change_transaction.changed_items, changelog_change_transaction.xmin
-> Hash Left Join (cost=2498235.67..7301991.85 rows=27188744 width=152)
Hash Cond: (changelog_change_transaction.changelog_change_groupid = changelog_change_group.changelog_change_groupid)
Filter: ((changelog_change_group.companyid = 40) OR (changelog_change_group.companyid = 1) OR (changelog_change_group.companyid = 53) OR (changelog_change_group.companyid IS NULL))
-> Hash Join (cost=2142692.83..5176273.34 rows=27726867 width=152)
Hash Cond: (changelog_change_stats.changelog_change_transactionid = changelog_change_transaction.changelog_change_transactionid)
-> Seq Scan on changelog_change_stats (cost=0.00..689345.48 rows=33612148 width=6)
-> Hash (cost=1215858.45..1215858.45 rows=27272350 width=152)
-> Seq Scan on changelog_change_transaction (cost=0.00..1215858.45 rows=27272350 width=152)
Filter: (started_at > '2017-04-21 00:00:00-07'::timestamp with time zone)
-> Hash (cost=200291.82..200291.82 rows=8931282 width=10)
-> Seq Scan on changelog_change_group (cost=0.00..200291.82 rows=8931282 width=10)

Log execution:

2018-04-05 09:07:12.444 PDT rhost=[local] app=psql:user=postgres:db=telecom:LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp6455.19", size 9437168
......
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:LOG: server process (PID 6455) was terminated by signal 9: Killed
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:DETAIL: Failed process was running: SELECT count(*)
FROM "changelog_change_transaction"
INNER JOIN "changelog_change_stats" ON ( changelog_change_stats.changelog_change_transactionid = changelog_change_transaction.changelog_change_transactionid )
LEFT OUTER JOIN "changelog_change_group" ON ( changelog_change_transaction.changelog_change_groupid = changelog_change_group.changelog_change_groupid )

WHERE ( changelog_change_group.companyid = 40 OR changelog_change_group.companyid = 1 OR changelog_change_group.companyid = 53 OR changelog_change_group.companyid IS NULL )
AND changelog_change_transaction.started_at > '2017-04-21'
GROUP BY "changelog_change_transaction"."changelog_change_transactionid", "changelog_change_transaction"."epoch", "changelog_change_transaction"
."changelog_change_groupid", "changelog_change_transaction"."started_at", "changelog_change_transaction"."duration_microseconds", "changelog_change_transaction"."changed_items", "changelog_change_transaction"."xmin"
;
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:LOG: terminating any other active server processes
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:WARNING: terminating connection because of crash of another server process

Monitoring the /proc/*/status of the running client process ,I've noticed this :

VmData: 7944 kB
VmData: 7944 kB
VmData: 7944 kB
VmData: 29788 kB
VmData: 30696 kB
VmData: 31724 kB
...
VmData: 33776 kB
...
VmData: 37876 kB
...
VmData: 46072 kB
...
VmData: 55272 kB
VmData: 67568 kB
VmData: 76032 kB
VmData: 76484 kB
VmData: 78156 kB
...
...
VmData: 80208 kB
...
VmData: 84308 kB
VmData: 92504 kB
VmData: 198972 kB
VmData: 354620 kB
VmData: 495208 kB
...
VmData: 682364 kB
VmData: 788988 kB
VmData: 821756 kB
VmData: 887292 kB
...
VmData: 961020 kB
...
VmData: 993788 kB
...
VmData: 1001980 kB

For sure if the GROUP BY the one that causes this OOM (when I removed it, the query finish ok ) , so I've change the query-plan to avoid the HashAggregate:

telecom=# show enable_hashagg ;
enable_hashagg
----------------
off

But the explain still shows:

HashAggregate (cost=19768044.56..20039932.00 rows=27188744 width=152)
Group Key: changelog_change_transaction.changelog_change_transactionid, changelog_change_transaction.epoch, changelog_change_transaction.changelog_change_groupid, changelog_change_transaction.started_at, changelog_change_transaction.duration_microseconds, changelog_change_transaction.changed_items, changelog_change_transaction.xmin

Any help or light on this will be really appreciated!

Regards

Jorge Daniel Fernandez

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2018-04-05 17:07:12 Re: [PATCH] Logical decoding of TRUNCATE
Previous Message Adam =?utf-8?Q?Sj=C3=B8gren?= 2018-04-05 16:03:23 Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100