Re: SIGSEGV in BRIN autosummarize

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: SIGSEGV in BRIN autosummarize
Date: 2017-10-17 06:33:11
Message-ID: 20171017063311.GB13880@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> Thanks, but I'm not sure that'll help, at this point. We already know
> what happened (corrupted memory), we don't know "how". And core files
> are mostly just "snapshots" so are not very useful in answering that :-(

Is there anything I should be saving for these or hints how else to debug? I
just got a segfault while looking for logs from an segfault from 2hr
earlier...but it was a duplicate from the first stacktrace...

< 2017-10-16 23:21:23.172 -04 >LOG: server process (PID 31543) was terminated by signal 11: Segmentation fault
< 2017-10-16 23:21:23.172 -04 >DETAIL: Failed process was running: autovacuum: BRIN summarize public.gtt 162175

Oct 16 23:21:22 database kernel: postmaster[31543] general protection ip:4bd467 sp:7ffe08a94890 error:0 in postgres[400000+692000]
...
Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 000000000084a177 sp 00007ffe08a94a88 error 4 in postgres[400000+692000]
Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-17-01:58:38-8646 (7692288 bytes)
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected

Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 (/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-17-01:58:38-8646 (7692288 bytes)
Oct 17 01:58:38 database abrtd: Directory 'ccpp-2017-10-17-01:58:38-8646' creation detected
Oct 17 01:59:18 database abrtd: Generating core_backtrace
Oct 17 01:59:18 database abrtd: Duplicate: core backtrace
Oct 17 01:59:18 database abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2017-10-14-18:05:35-26500
Oct 17 01:59:18 database abrtd: Deleting problem directory ccpp-2017-10-17-01:58:38-8646 (dup of ccpp-2017-10-14-18:05:35-26500)

postgres=# SELECT * FROM postgres_log_2017_10_16_2300 WHERE pid=31543;
-[ RECORD 1 ]----------+-------------------------------------------------------------------------------------------------
log_time | 2017-10-16 23:21:22.021-04
pid | 31543
session_id | 59e57677.7b37
session_line | 2
session_start_time | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54376
transaction_id | 0
error_severity | ERROR
sql_state_code | 57014
message | canceling autovacuum task
context | automatic analyze of table "gtt.public.cdrs_eric_egsnpdprecord_2017_10_15"
-[ RECORD 2 ]----------+-------------------------------------------------------------------------------------------------
log_time | 2017-10-16 23:21:22.021-04
pid | 31543
session_id | 59e57677.7b37
session_line | 3
session_start_time | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54377
transaction_id | 0
error_severity | ERROR
sql_state_code | 57014
message | canceling autovacuum task
context | processing work entry for relation "gtt.public.cdrs_eric_msc_voice_2017_10_16_startofcharge_idx"
-[ RECORD 3 ]----------+-------------------------------------------------------------------------------------------------
log_time | 2017-10-16 23:21:21.003-04
pid | 31543
session_id | 59e57677.7b37
session_line | 1
session_start_time | 2017-10-16 23:18:15-04
virtual_transaction_id | 22/54374
transaction_id | 0
error_severity | ERROR
sql_state_code | 57014
message | canceling autovacuum task
context | automatic analyze of table "gtt.public.cdrs_eric_ggsnpdprecord_2017_10_15"

Note, the most recent LOCK TABLE I could find beofre 23:21:22 was actually
~10min earlier at 2017-10-16 23:12:16.519-04.

Core was generated by `postgres: autovacuum worker process gtt '.
Program terminated with signal 11, Segmentation fault.
#0 index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
178 LockRelId relid = relation->rd_lockInfo.lockRelId;
Missing separate debuginfos, use: debuginfo-install audit-libs-2.4.5-6.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libgcc-4.4.7-18.el6.x86_64 libicu-4.2.1-14.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libstdc++-4.4.7-18.el6.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 pam-1.1.1-24.el6.x86_64 zlib-1.2.3-29.el6.x86_64

(gdb) p relation
$1 = (struct RelationData *) 0x324647603246466
(gdb) p relation->rd_lockInfo
Cannot access memory at address 0x3246476032464aa

(gdb) bt
#0 index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
#1 0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
#2 0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1190
#3 0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730
#4 0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609
#5 0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
#6 0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1694
#7 0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
#8 0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
#9 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111
#10 <signal handler called>
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228

#0 index_close (relation=0x324647603246466, lockmode=1) at indexam.c:178
relid = {relId = 21345368, dbId = 0}
#1 0x00000000004bc3c2 in systable_endscan (sysscan=0x145b3c8) at genam.c:504
No locals.
#2 0x000000000080f266 in SearchCatCache (cache=0x1427168, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1190
cur_skey = {{sk_flags = 0, sk_attno = 1, sk_strategy = 3, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0x7a8d50 <oideq>, fn_oid = 184, fn_nargs = 2, fn_strict = 1 '\001',
fn_retset = 0 '\000', fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x13cf3d8, fn_expr = 0x0}, sk_argument = 618293329}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0,
sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0},
{sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0, fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000',
fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}, {sk_flags = 0, sk_attno = 0, sk_strategy = 0, sk_subtype = 0, sk_collation = 0, sk_func = {fn_addr = 0, fn_oid = 0, fn_nargs = 0,
fn_strict = 0 '\000', fn_retset = 0 '\000', fn_stats = 0 '\000', fn_extra = 0x0, fn_mcxt = 0x0, fn_expr = 0x0}, sk_argument = 0}}
hashValue = 884019
hashIndex = 51
iter = {cur = <value optimized out>, end = 0x14276a8}
bucket = 0x14276a8
ct = <value optimized out>
relation = 0x7fd6b8788e58
scandesc = 0x145b3c8
ntp = <value optimized out>
#3 0x000000000081cd7b in get_rel_name (relid=<value optimized out>) at lsyscache.c:1730
tp = <value optimized out>
#4 0x00000000006a5299 in perform_work_item (workitem=0x7fd7c7f94860) at autovacuum.c:2609
cur_datname = 0x0
cur_nspname = 0x0
cur_relname = 0x0
__func__ = "perform_work_item"
#5 0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
workitem = 0x7fd7c7f94860
classRel = 0x7fd6b8788e58
tuple = <value optimized out>
relScan = <value optimized out>
dbForm = <value optimized out>
table_oids = <value optimized out>
orphan_oids = 0x0
ctl = {num_partitions = 0, ssize = 0, dsize = 0, max_dsize = 0, ffactor = 0, keysize = 4, entrysize = 80, hash = 0, match = 0, keycopy = 0, alloc = 0, hcxt = 0x0, hctl = 0x0}
table_toast_map = 0x14d5228
cell = 0x0
shared = 0x1499eb8
dbentry = 0x149a140
bstrategy = 0x156ef38
key = {sk_flags = 0, sk_attno = 16, sk_strategy = 3, sk_subtype = 0, sk_collation = 100, sk_func = {fn_addr = 0x750430 <chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x1400838, fn_expr = 0x0}, sk_argument = 116}
pg_class_desc = 0x14c3748
effective_multixact_freeze_max_age = 400000000
did_vacuum = <value optimized out>
found_concurrent_worker = <value optimized out>
i = <value optimized out>
__func__ = "do_autovacuum"
#6 0x00000000006a7dc8 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1694
dbname = "gtt\000\000\000\000\000\325I\001\327;\000\000\000\067{\000\000\000\000\000\000E\352\200\327;\000\000\000\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\374\024\316", '\000' <repeats 12 times>
local_sigjmp_buf = {{__jmpbuf = {0, -6625136415178345871, 20928464, 1508210280, 31539, 1508209622, 6625596434918027889, -6625136601089035663}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 13184024, 1, 13182664, 0, 31539, 1508209622, 0, 257010229664, 1, 0, 140729043734800, 257014425560, 4, 0, 4}}}}
dbid = 16400
__func__ = "AutoVacWorkerMain"
#7 0x00000000006a7e96 in StartAutoVacWorker () at autovacuum.c:1499
worker_pid = 0
__func__ = "StartAutoVacWorker"
#8 0x00000000006b595a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5414
bn = 0x13f57d0
#9 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5111
save_errno = 4
__func__ = "sigusr1_handler"
#10 <signal handler called>
No symbol table info available.
#11 0x0000003bd74e1603 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:82
---Type <return> to continue, or q <return> to quit---
No locals.
#12 0x00000000006b6c96 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1717
timeout = {tv_sec = 59, tv_usec = 981940}
rmask = {fds_bits = {120, 0 <repeats 15 times>}}
selres = <value optimized out>
now = <value optimized out>
readmask = {fds_bits = {120, 0 <repeats 15 times>}}
nSockets = 7
last_lockfile_recheck_time = 1508210280
last_touch_time = 1508209622
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
opt = <value optimized out>
status = <value optimized out>
userDoption = <value optimized out>
listen_addr_saved = <value optimized out>
i = <value optimized out>
output_config_variable = <value optimized out>
__func__ = "PostmasterMain"
#14 0x0000000000638450 in main (argc=3, argv=0x13cd3a0) at main.c:228

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-10-17 07:03:19 Re: [PATCH] pageinspect function to decode infomasks
Previous Message Rafia Sabih 2017-10-17 05:01:32 Re: [POC] Faster processing at Gather node