Re: BUG #16331: segfault in checkpointer with full disk

From: Jozef Mlich <jmlich83(at)gmail(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16331: segfault in checkpointer with full disk
Date: 2020-04-01 09:51:16
Message-ID: cb90caff210d67bee6be0752b665bcac862d1e25.camel@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote:
> Hi,
>
> On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form
> wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference: 16331
> > Logged by: Jozef Mlich
> > Email address: jmlich83(at)gmail(dot)com
> > PostgreSQL version: 12.2
> > Operating system: CentOS
> > Description:
> >
> > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7
> > (from
> > yum.postgresql.org). I am using multiple extensions (cstore,
> > postgres_fdw,
> > pgcrypto,dblink, etc.). It seems crash is related to disk run out
> > of space
> > (I am using separate partion for / and for /var/lib/pgsql). It
> > occurs few
> > times a day. According to backtrace it seems to be related to
> > checkpointer.
> > Replication is not configured.
> >
> >
> > [New LWP 26290]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `postgres:
> > checkpointer
> > '.
> > Program terminated with signal 6, Aborted.
> > #0 0x00007fe4604c1207 in __GI_raise (sig=sig(at)entry=6) at
> > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> >
> > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)):
> > #0 0x00007fe4604c1207 in __GI_raise (sig=sig(at)entry=6) at
> > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > resultvar = 0
> > pid = 26290
> > selftid = 26290
> > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90
> > save_stage = 2
> > act = {__sigaction_handler = {sa_handler = 0x0,
> > sa_sigaction = 0x0},
> > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717,
> > 39808819211026438, 20126216749056, 70394513997832, 9268713,
> > 70403103920719,
> > 17316096998686159616, 20134806683648, 140618848608704,
> > 140618848592800}},
> > sa_flags = 1615828275, sa_restorer = 0x0}
> > sigs = {__val = {32, 0 <repeats 15 times>}}
> > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at
> > elog.c:552
> > edata = 0xd47040 <errordata>
> > elevel = 22
> > oldcontext = 0x171a6d0
> > econtext = 0x0
> > __func__ = "errfinish"
> > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at
> > origin.c:562
> > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp"
> > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint"
> > tmpfd = <optimized out>
> > i = <optimized out>
> > magic = 307747550
> > crc = 4294967295
> > __func__ = "CheckPointReplicationOrigin"
>
> That's not a bug (nor a segfault) but the expected behavior if the
> checkpointer is not able to do its work. As data durability can't be
> guaranteed in such case, the checkpointer raises a PANIC level
> message, which raises an abort so that the whole instance do an
> emergency restart cycle.
>
> Do you have monitoring for this filesystem? Do you see spikes in
> disk usage or other strange behavior?

Then it is clear. Thanks for explanation and applogize for false bug
report.

I have probably misunderstood how is segfault distinguished from abort.
I need to fix my kernel.core_pattern script.

In attachment is screenshot from monitoring grafana with information
about space on /var/lib/pgsql partition.

--
Jozef Mlich <jmlich83(at)gmail(dot)com>

Attachment Content-Type Size
image/png 26.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dmitry Dolgov 2020-04-01 12:01:17 Re: BUG #16325: Assert failure on partitioning by int for a text value with a collation
Previous Message Julien Rouhaud 2020-04-01 09:04:55 Re: BUG #16331: segfault in checkpointer with full disk