Segmentation fault mayhem

From: Sebastian Gröbler <sebastian(dot)groebler(at)quandoo(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Segmentation fault mayhem
Date: 2017-09-25 18:19:45
Message-ID: etPan.59c948c1.6db2c4a0.17309@quandoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hey guys,

first off a big thank you for providing and maintaining this awesome piece of software.

So here comes the problem, we are using postgres 9.6.1 (recently upgraded from 9.2) on Debian Jessie.

Our setup consists of a master with wal_level=logical which streams to a hot standby (used for failover) and the hot standby streams to a "read slave" which is used for BI use-cases.
Aside from that we are using the Database Migration Service (DMS) from AWS to stream our data into a new system which is supposed to replace the BI "read slave". DMS is required to run on the master and aside from a user, function, trigger and a table it does not require much more of the source.

We have been in production with 9.6.1 for a couple of weeks without DMS and without any problems. Then we turned on DMS and over the last weekend we had several segmentation fault crashes. So we turned off DMS and moved back to "replica" and enabled core dumps. However it seemed that DMS was not related to the problem as the database kept crashing.

This is the log and the core dump backtrace:

2017-09-24 08:26:26 UTC [30797-9] LOG:  server process (PID 32738) was terminated by signal 11: Segmentation fault
2017-09-24 08:26:26 UTC [30797-10] DETAIL:  Failed process was running: select pushnotifi0_.pnc_id as pnc_id1_2_, pushnotifi0_.pnc_c_id as pnc_c_id2_2_, pushnotifi0_.pnc_c_public_id as pnc_c_pu3_2_, pushnotifi0_.pnc_device_owner as pnc_devi4_2_, pushnotifi0_.pnc_error as pnc_erro5_2_, pushnotifi0_.pnc_expiration_date as pnc_expi6_2_, pushnotifi0_.pnc_message as pnc_mess7_2_, pushnotifi0_.pnc_mdi_id as pnc_mdi_8_2_, pushnotifi0_.pnc_provider as pnc_prov9_2_, pushnotifi0_.pnc_reference_id as pnc_ref10_2_, pushnotifi0_.pnc_retry as pnc_ret11_2_, pushnotifi0_.pnc_status as pnc_sta12_2_, pushnotifi0_.pnc_target_id as pnc_tar13_2_, pushnotifi0_.pnc_target_public_id as pnc_tar14_2_, pushnotifi0_.pnc_type as pnc_typ15_2_ from crm.push_notifications_cron pushnotifi0_ where pushnotifi0_.pnc_type=$1 and pushnotifi0_.pnc_reference_id=$2 and pushnotifi0_.pnc_mdi_id=$3

Core was generated by `postgres: 9.6/main: postgres booking x.x.x.x(xxxx) SELECT               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fe0c2411721 in EOH_get_flat_size ()
(gdb) bt
#0  0x00007fe0c2411721 in EOH_get_flat_size ()
#1  0x00007fe0c240e178 in datumEstimateSpace ()
#2  0x00007fe0c231ea9d in EstimateParamListSpace ()
#3  0x00007fe0c22b933d in ExecInitParallelPlan ()
#4  0x00007fe0c22cb316 in ExecGather ()
#5  0x00007fe0c22b9b08 in ExecProcNode ()
#6  0x00007fe0c22b5d26 in standard_ExecutorRun ()
#7  0x00007fe0c23d6527 in ?? ()
#8  0x00007fe0c23d7b38 in PortalRun ()
#9  0x00007fe0c23d4843 in PostgresMain ()
#10 0x00007fe0c21500a8 in ?? ()
#11 0x00007fe0c23722b0 in PostmasterMain ()
#12 0x00007fe0c2151267 in main ()

The selected table has about a million records and there is no index on the fields in the where clause.
The pnc_type is a custom type, pnc_reference_id is a varchar(50) and pnc_mdi_id is a uuid.
Running the query we were not able to reproduce the problem but we saw that postgres would use the parallel query feature.

So next we disabled parallel queries by setting the max_parallel_workers_per_gather=0 and restarted the server, but we also set back the wal_level=logical.
Since we had a crash with wal_level=replica and without DMS we ruled out DMS and wanted to turn it back on.
The moment we did, or actually a few seconds later the database crashed again.

Here is the log and the core dump backtrace:

2017-09-25 10:27:24 UTC [101978-1] dms(at)booking LOG:  starting logical decoding for slot "uhbeasbguomldspa_00016400_846fa332_0d73_4e40_bef4_da6c7a70c868"
2017-09-25 10:27:24 UTC [101978-2] dms(at)booking DETAIL:  streaming transactions committing after B96/1FD4BB78, reading WAL from B96/1FD241C8
2017-09-25 10:27:24 UTC [101978-3] dms(at)booking LOG:  logical decoding found consistent point at B96/1FD241C8
2017-09-25 10:27:24 UTC [101978-4] dms(at)booking DETAIL:  There are no running transactions.
2017-09-25 10:27:25 UTC [94708-8] LOG:  server process (PID 101978) was terminated by signal 11: Segmentation fault
2017-09-25 10:27:25 UTC [94708-9] LOG:  terminating any other active server processes

Core was generated by `postgres: 9.6/main: wal sender process dms x.x.x.x(xxxx) idle           '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd3190ccad6 in ReorderBufferCommit ()
(gdb) bt
#0  0x00007fd3190ccad6 in ReorderBufferCommit ()
#1  0x00007fd3190c5e00 in LogicalDecodingProcessRecord ()
#2  0x00007fd3190d08a9 in ?? ()
#3  0x00007fd3190d2433 in ?? ()
#4  0x00007fd3190d3282 in exec_replication_command ()
#5  0x00007fd319116b35 in PostgresMain ()
#6  0x00007fd318e930a8 in ?? ()
#7  0x00007fd3190b52b0 in PostmasterMain ()
#8  0x00007fd318e94267 in main ()

Maybe it is important to note that we have two databases in the cluster and that DMS was set up to have one task for each, so there were two logical replications running simultaneously.

After that last failure we wanted to rule out hardware issues and failed over to our hot standby. Which is currently running in wal_level=replica and thus so far without DMS.

Soon we would like to switch back to wal_level=logical and add DMS again to confirm that it is a hardware issue.
Should the issue persist we will probably upgrade to 9.6.5.

We roughly skimmed through the release notes of 9.6.2, 9.6.3, 9.6.4 and 9.6.5 but couldn't find anything that obviously relates to our problem.

Do you guys have any ideas or tips on what the cause may be and what else to do?

Thanks for your help!
Sebastian

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2017-09-25 18:29:15 Re: Segmentation fault mayhem
Previous Message Andrew Dunstan 2017-09-25 18:17:09 Re: BUG #14825: enum type: unsafe use?