SIGSEGV in BRIN autosummarize

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: SIGSEGV in BRIN autosummarize
Date: 2017-10-14 03:57:32
Message-ID: 20171014035732.GB31726@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I upgraded one of our customers to PG10 Tuesday night, and Wednesday replaced
an BTREE index with BRIN index (WITH autosummarize).

Today I see:
< 2017-10-13 17:22:47.839 -04 >LOG: server process (PID 32127) was terminated by signal 11: Segmentation fault
< 2017-10-13 17:22:47.839 -04 >DETAIL: Failed process was running: autovacuum: BRIN summarize public.gtt 747263

postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in postgres[400000+692000]

[pryzbyj(at)database ~]$ rpm -qa postgresql10
postgresql10-10.0-1PGDG.rhel6.x86_64

Oct 13 17:22:45 database kernel: postmaster[32127] general protection ip:4bd467 sp:7ffd9b349990 error:0 in postgres[400000+692000]
Oct 13 17:22:47 database abrtd: Directory 'ccpp-2017-10-13-17:22:47-32127' creation detected
Oct 13 17:22:47 database abrt[32387]: Saved core dump of pid 32127 (/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-13-17:22:47-32127 (15040512 bytes)

..unfortunately:
Oct 13 17:22:47 database abrtd: Package 'postgresql10-server' isn't signed with proper key
Oct 13 17:22:47 database abrtd: 'post-create' on '/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127' exited with 1
Oct 13 17:22:47 database abrtd: DELETING PROBLEM DIRECTORY '/var/spool/abrt/ccpp-2017-10-13-17:22:47-32127'

postgres=# SELECT * FROM bak_postgres_log_2017_10_13_1700 WHERE pid=32127 ORDER BY log_time DESC LIMIT 9;
-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------
log_time | 2017-10-13 17:22:45.56-04
pid | 32127
session_id | 59e12e67.7d7f
session_line | 2
command_tag |
session_start_time | 2017-10-13 17:21:43-04
error_severity | ERROR
sql_state_code | 57014
message | canceling autovacuum task
context | processing work entry for relation "gtt.public.cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx"
-[ RECORD 2 ]----------+---------------------------------------------------------------------------------------------------------
log_time | 2017-10-13 17:22:44.557-04
pid | 32127
session_id | 59e12e67.7d7f
session_line | 1
session_start_time | 2017-10-13 17:21:43-04
error_severity | ERROR
sql_state_code | 57014
message | canceling autovacuum task
context | automatic analyze of table "gtt.public.cdrs_huawei_sgsnpdprecord_2017_10_13"

Time: 375.552 ms

It looks like this table was being inserted into simultaneously by a python
program using multiprocessing. It looks like each subprocess was INSERTing
into several tables, each of which has one BRIN index on timestamp column.

gtt=# \dt+ cdrs_eric_egsnpdprecord_2017_10_13
public | cdrs_eric_egsnpdprecord_2017_10_13 | table | gtt | 5841 MB |

gtt=# \di+ cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx
public | cdrs_eric_egsnpdprecord_2017_10_13_recordopeningtime_idx | index | gtt | cdrs_eric_egsnpdprecord_2017_10_13 | 136 kB |

I don't have any reason to believe there's memory issue on the server, So I
suppose this is just a "heads up" to early adopters until/in case it happens
again and I can at least provide a stack trace.

Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-10-14 05:25:06 Re: show precise repos version for dev builds?
Previous Message Noah Misch 2017-10-14 02:09:41 Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))