signal 11 segfaults with parallel workers

From: Rick Otten <rottenwindfish(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: signal 11 segfaults with parallel workers
Date: 2017-07-26 14:47:26
Message-ID: CAMAYy4J0AkJacQjcPYE4o+k-Sx=4vA36eEQj3e-h+cpNe+LAzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Starting a couple of weeks ago, our PostgreSQL database has been crashing,
almost daily, with a signal 11 seg fault on a query as the triggering event:

2017-07-11 23:00:29.984 UTC LOG: worker process: parallel worker for
PID 1055 (PID 12405) was terminated by signal 11: Segmentation fault
2017-07-12 23:01:56.432 UTC LOG: worker process: parallel worker for
PID 5752 (PID 32552) was terminated by signal 11: Segmentation fault
2017-07-14 23:00:46.856 UTC LOG: worker process: parallel worker for
PID 24280 (PID 9639) was terminated by signal 11: Segmentation fault
2017-07-15 23:01:24.317 UTC LOG: worker process: parallel worker for
PID 1561 (PID 15153) was terminated by signal 11: Segmentation fault
2017-07-16 23:00:26.722 UTC LOG: worker process: parallel worker for
PID 5776 (PID 7912) was terminated by signal 11: Segmentation fault
2017-07-17 18:58:14.155 UTC LOG: worker process: parallel worker for
PID 11427 (PID 9998) was terminated by signal 11: Segmentation fault
2017-07-17 19:08:04.103 UTC LOG: worker process: parallel worker for
PID 10190 (PID 11907) was terminated by signal 11: Segmentation fault
2017-07-18 23:01:09.775 UTC LOG: worker process: parallel worker for
PID 29445 (PID 360) was terminated by signal 11: Segmentation fault
2017-07-19 18:46:58.676 UTC LOG: worker process: parallel worker for
PID 7080 (PID 27710) was terminated by signal 11: Segmentation fault
2017-07-20 23:00:35.270 UTC LOG: worker process: parallel worker for
PID 19153 (PID 21218) was terminated by signal 11: Segmentation fault
2017-07-21 23:00:41.085 UTC LOG: worker process: parallel worker for
PID 19161 (PID 30720) was terminated by signal 11: Segmentation fault
2017-07-22 23:00:22.169 UTC LOG: worker process: parallel worker for
PID 4903 (PID 6931) was terminated by signal 11: Segmentation fault
2017-07-25 23:02:03.688 UTC LOG: worker process: parallel worker for
PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

As near as I can tell there were no specific changes preceding this
pattern which might be a root cause. Since then I've tried patching the
Linux instance and bounced the database server, and bumped up the number of
connections (because we were running low sometimes). None of those changes
impacted the regular crashing pattern.

On Sunday (2017-07-23) I set DEBUG5 on all log events, and set it to also
log all queries, so I could try to learn more about what was happening. I
found the culprit query last night, from one of our daily jobs. It was
doing a 5 thread parallel sequence scan on a moderately sized table (maybe
70 columns, by 2.5M rows).

I was not able to force the database to crash by running this query by
hand. I tried a number of times. Although it did happen to someone else
on the 17th.

For now, I've put an index on the relevant columns to avoid the parallel
sequence scan for that query. I also repacked the table. Hopefully we
won't crash tonight too.

There wasn't much extra in the logs to share about the crash. This is from
when it crashed:

2017-07-25 23:02:03.688 UTC DEBUG: reaping dead processes
2017-07-25 23:02:03.688 UTC LOG: worker process: parallel worker for
PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

And this is when it spun out those parallel workers, just prior to the
segfault, that let me identify the query in question:

2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: registering background worker
"parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC DEBUG: starting background worker process
"parallel worker for PID 11099"

Here is what /var/log/kern.log had to say about the one from last night:

Jul 25 23:02:01 core-gce kernel: [738031.417934] postgres[11279]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417953] postgres[11278]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417967] postgres[11280]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417989] postgres[11276]: segfault
at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in
postgres[55dc249cd000+64c000]

I'm running on Ubuntu 16.04.02 in Google Compute Environment, on a 16 core
VM with 104G RAM, using the Ubuntu Postgresql 9.6.3 package.

$ pg_config --configure
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl'
'--with-libxml' '--with-libxslt'
'--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6'
'--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python'
'--mandir=/usr/share/postgresql/9.6/man'
'--docdir=/usr/share/doc/postgresql-doc-9.6'
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/'
'--datadir=/usr/share/postgresql/9.6'
'--bindir=/usr/lib/postgresql/9.6/bin'
'--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/'
'--includedir=/usr/include/postgresql/' '--enable-nls'
'--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests'
'--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld'
'--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo'
'--with-systemd' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat
-Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie
-fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro
-Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5
-L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi'
'--with-ldap' '--with-selinux' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'

$ pg_config --ldflags
-L../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now
-Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5
-Wl,--as-needed

$ pg_config --cflags
-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -Wmissing-format-attribute -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2
-fstack-protector-strong -Wformat -Werror=format-security
-I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer

I have these two most relevant settings enabled in my configuration:
max_worker_processes = 16
max_parallel_workers_per_gather = 16

If you need anything else, please let me know. I wish I could reproduce
the error every time I ran the query, but it doesn't seem to work that way,
and of course now the query plan is completely different, but I'm sure I
can run other queries that would induce parallel sequence scans on my
tables.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2017-07-26 14:58:56 Re: signal 11 segfaults with parallel workers
Previous Message Feike Steenbergen 2017-07-26 08:58:59 Re: BUG #14746: sub query's plan not right?