Re: dsa_allocate() faliure

From: Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: alvherre(at)2ndquadrant(dot)com, pryzby(at)telsasoft(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: dsa_allocate() faliure
Date: 2018-11-30 19:20:49
Message-ID: CAJk1zg2kgnAbWAuM3oG20EN_Fvin2Z5OtWJHTR711S2jbNwQQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hi, just a small update.
I've configured the OS for taking crash dumps on Ubuntu 16.04 with the
following (maybe somebody will find it helpful):
I've added LimitCORE=infinity to /lib/systemd/system/postgresql(at)(dot)service
under [Service] section
I've reloaded the service config with sudo systemctl daemon-reload
Changed the core pattern to: sudo echo
/var/lib/postgresql/core.%p.sig%s.%ts | tee -a /proc/sys/kernel/core_pattern
I had tested it with kill -ABRT pidofbackend and it behaved correctly. A
crash dump was written.

In the last days I've been monitoring no segfault occurred but the
das_allocation did.
I'm starting to doubt if the segfault I've found in dmesg was actually
related.

I've grepped the postgres log for dsa_allocated:
Why do the messages occur sometimes as FATAL and sometimes as ERROR?

2018-11-29 07:59:06 CET::@:[20584]: FATAL: dsa_allocate could not find 7
free pages
2018-11-29 07:59:06 CET:127.0.0.1(40846):user(at)db:[19507]: ERROR:
dsa_allocate could not find 7 free pages
2018-11-30 09:04:13 CET::@:[27341]: FATAL: dsa_allocate could not find 13
free pages
2018-11-30 09:04:13 CET:127.0.0.1(41782):user(at)db:[25417]: ERROR:
dsa_allocate could not find 13 free pages
2018-11-30 09:28:38 CET::@:[30215]: FATAL: dsa_allocate could not find 4
free pages
2018-11-30 09:28:38 CET:127.0.0.1(45980):user(at)db:[29924]: ERROR:
dsa_allocate could not find 4 free pages
2018-11-30 16:37:16 CET::@:[14385]: FATAL: dsa_allocate could not find 7
free pages
2018-11-30 16:37:16 CET::@:[14375]: FATAL: dsa_allocate could not find 7
free pages
2018-11-30 16:37:16 CET:212.186.105.45(55004):user(at)db:[14386]: FATAL:
dsa_allocate could not find 7 free pages
2018-11-30 16:37:16 CET:212.186.105.45(54964):user(at)db:[14379]: ERROR:
dsa_allocate could not find 7 free pages
2018-11-30 16:37:16 CET:212.186.105.45(54916):user(at)db:[14370]: ERROR:
dsa_allocate could not find 7 free pages
2018-11-30 16:45:11 CET:212.186.105.45(55356):user(at)db:[14555]: FATAL:
dsa_allocate could not find 7 free pages
2018-11-30 16:49:13 CET::@:[15359]: FATAL: dsa_allocate could not find 7
free pages
2018-11-30 16:49:13 CET::@:[15363]: FATAL: dsa_allocate could not find 7
free pages
2018-11-30 16:49:13 CET:212.186.105.45(54964):user(at)db:[14379]: FATAL:
dsa_allocate could not find 7 free pages
2018-11-30 16:49:13 CET:212.186.105.45(54916):user(at)db:[14370]: ERROR:
dsa_allocate could not find 7 free pages
2018-11-30 16:49:13 CET:212.186.105.45(55842):user(at)db:[14815]: ERROR:
dsa_allocate could not find 7 free pages
2018-11-30 16:56:11 CET:212.186.105.45(57076):user(at)db:[15638]: FATAL:
dsa_allocate could not find 7 free pages

There's quite a bit errors from today but I was launching the problematic
query in parallel from 2-3 sessions.
Sometimes it was breaking sometimes not.
Couldn't find any pattern.
The workload on this db is not really constant, rather bursting.

--
regards,
Jakub Glapa

On Tue, Nov 27, 2018 at 9:03 AM Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Tue, Nov 27, 2018 at 4:00 PM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > Hmm. I will see if I can come up with a many-partition torture test
> > reproducer for this.
>
> No luck. I suppose one theory that could link both failure modes
> would a buffer overrun, where in the non-shared case it trashes a
> pointer that is later dereferenced, and in the shared case it writes
> past the end of allocated 4KB pages and corrupts the intrusive btree
> that lives in spare pages to track available space.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2018-11-30 19:55:27 Re: Add function to release an allocated SQLDA
Previous Message Andrew Dunstan 2018-11-30 19:18:18 Re: pgsql: Switch pg_verify_checksums back to a blacklist

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2018-11-30 20:46:47 Re: dsa_allocate() faliure
Previous Message Pavel Stehule 2018-11-30 14:54:12 Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0