8.3.0 backend segfaults

From: "Alex Hunsaker" <badalex(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: 8.3.0 backend segfaults
Date: 2008-03-12 04:34:56
Message-ID: 34d269d40803112134l6e6b88eet3638312ffd628766@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

(Sorry if duplicates show up this is the third time ive posted this in
the past 10 hours, Im assuming it got dropped because of the
attachments)

Problem: Apparently random segfaults apparently query agnostic, seem
to be more frequent when a pg_dump is running

The most frequent query it segfaults with is:
select w.worker_id, w.worker_id as printerid, w.worker, w.alias,
coalesce(w.alias, w.worker) as name, w.active, w.last_active,
w.last_deactive, round(extract(epoch from now()) - extract(epoch from
w.last_deactive)) as time_off from workers as w left join worker_vis
as wv on wv.worker_id = w.worker_id and wv.defunct = 0 and (
((wv.auth_id = ?) and (wv.auth_class = data_class('user_id'))) or
((wv.auth_id = ?) and (wv.auth_class = data_class('clinic_id'))))
where wv.worker_vis_id is not null and w.defunct = 0 order by
coalesce(w.alias, w.worker);

(sample arguments, they do not seem to make a difference)
167214, 13
167340, 16

But have seen other simpler queries such as (sorry no backtraces for
these... yet)
insert into audit_transaction (action, open_user_id, page_load_id,
user_id) values (?, ?, ?, ?); ARGS=suds, 509057, 15148365, 167217

If I select * from workers; and select * from workers_active;
everything looks good, no segfaults. I can also take a pg_dump if
virtually nothing is using the database. Otherwise pg_dump dies
randomly with an errors like:

Dumping the contents of table "clients_audit" failed: PQgetCopyData() failed.
Dumping the contents of table "file_data" failed: PQgetCopyData() failed.
Dumping the contents of table "workers_audit" failed: PQgetCopyData() failed.

(note these are probably 3 of the larger tables in the database, maybe
it has something to do with statement_timeout, i do have it set quite
low (3 min) ?)

If I manually prepare and execute the above queries they work fine.
prepare worker (bigint, bigint) as select w.worker_id, w.worker_id as
printerid, w.worker, w.alias, coalesce(w.alias, w.worker) as name,
w.active, w.last_active, w.last_deactive, round(extract(epoch from
now()) - extract(epoch from w.last_deactive)) as time_off from workers
as w left join worker_vis as wv on wv.worker_id = w.worker_id and
wv.defunct = 0 and ( ((wv.auth_id = ?) and (wv.auth_class =
data_class('user_id'))) or ((wv.auth_id = ?) and (wv.auth_class =
data_class('clinic_id')))) where wv.worker_vis_id is not null and
w.defunct = 0 order by coalesce(w.alias, w.worker);
execute worker (167214, 13);

Core dumps and binaries available if needed. I have about 6 core dumps
across 2 different servers. Which leads me to believe its either
postgres bug or a corrupt database (the other server is a pitr slave
for the master which i switched over to see if i could reproduce the
segfault on the master)

Nothing in dmesg, nothing in mcelog.
Raid controller is a 3ware 9550SX-12 with 6 70GB WD Raptor hard drives
in a raid 10, smart tests pass fine... everything looks good.

Work load is a web application where each page beings a transaction;
creates a temp table, does a few selects, inserts and updates and the
commits.

Postgresql 8.2.5 was working without any problems just prior to the upgrade.

\d workers
Table "public.workers"
Column | Type |
Modifiers
---------------+--------------------------+-----------------------------------------------------------------
worker_id | bigint | not null default
nextval('workers_worker_id_seq'::regclass)
date_created | timestamp with time zone | not null default now()
clientid | bigint | not null
worker | text | not null
alias | text |
job_type | smallint | not null
active | smallint | not null default 0
last_active | timestamp with time zone |
last_deactive | timestamp with time zone |
defunct | smallint | not null default 0
audit_class | integer |
audit_date | timestamp with time zone | not null default now()
audit_desc | text |
audit_id | bigint |
audit_seq | integer | not null default
nextval('audit_basic_audit_seq_seq'::regclass)
audit_table | character varying(64) |
audit_tid | bigint |
audit_type | character varying(32) |
audit_orig_id | integer |
path | text |
Indexes:
"workers_pkey" PRIMARY KEY, btree (worker_id) CLUSTER
Triggers:
workers_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON workers FOR
EACH ROW EXECUTE PROCEDURE audit_table_go()

\d worker_vis
Table "public.worker_vis"
Column | Type |
Modifiers
---------------+--------------------------+-----------------------------------------------------------------
worker_vis_id | bigint | not null default
nextval('worker_vis_ruleid_seq'::regclass)
date_created | timestamp with time zone | not null default now()
worker_id | bigint |
auth_id | bigint |
auth_class | smallint |
defunct | smallint | not null default 0
audit_class | integer |
audit_date | timestamp with time zone | not null default now()
audit_desc | text |
audit_id | bigint |
audit_seq | integer | not null default
nextval('audit_basic_audit_seq_seq'::regclass)
audit_table | character varying(64) |
audit_tid | bigint |
audit_type | character varying(32) |
audit_orig_id | integer |
Indexes:
"worker_vis_pkey" PRIMARY KEY, btree (worker_vis_id)
"worker_vis_vis_idx" btree (worker_id, defunct, auth_id, auth_class) CLUSTER
Triggers:
worker_vis_audit_aud BEFORE INSERT OR DELETE OR UPDATE ON
worker_vis FOR EACH ROW EXECUTE PROCEDURE audit_table_go()

Version Information:

Following REL8.3_STABLE 2 versions tried both crash with a similar
backtrace (both died in CopySnapshot):

(currently running)
Author: teodor <teodor>
Date: Sun Mar 9 10:42:48 2008 +0000

Revert changes of CompareTSQ: it affects existing btree indexes.

(was running but also crashed)
Author: tgl <tgl>
Date: Sun Mar 2 00:10:28 2008 +0000

Venezuela Time now means UTC-4:30, not UTC-4:00. Adjust our treatment
of "VET" accordingly. Per bug #3997 from Aaron Mizrachi.

Memory: 8G ECC
Swap: 16GB (unused even at segfault time aka it does not seem to be
running out of memmory)
OS: Linux 2.6.24.3

select version();
PostgreSQL 8.3.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)

built with: --disable-nls --with-openssl --without-python
--without-tcl --without-krb5 --with-perl --enable-debug
--enable-integer-datetimes --with-ossp-uuid

postgresql.conf:
listen_addresses = '*'
port = 5432
max_connections = 500
superuser_reserved_connections = 3
ssl = on
shared_buffers = 1000MB
work_mem = 32MB
maintenance_work_mem = 256MB
max_stack_depth = 7MB
max_fsm_pages = 304800
max_fsm_relations = 3000
vacuum_cost_delay = 30
bgwriter_lru_maxpages = 500
wal_sync_method = fdatasync
wal_buffers = 256kB
checkpoint_segments = 8
checkpoint_timeout = 10min
archive_command = 'rsync -az -e "ssh" %p
psql-sync(at)10(dot)10(dot)10(dot)1:standby/cur/wal/%f </dev/null'
random_page_cost = 2.6
effective_cache_size = 6GB
default_statistics_target = 101
constraint_exclusion = on
log_destination = 'syslog'
client_min_messages = warning
log_min_messages = warning
silent_mode = on
log_line_prefix = '[%m] %h:%d '
log_lock_waits = on
log_temp_files = 20MB
autovacuum_naptime = 3min
autovacuum_vacuum_threshold = 200
autovacuum_analyze_threshold = 200
statement_timeout = '3min'
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
backslash_quote = off
escape_string_warning = on
sql_inheritance = off
standard_conforming_strings = on

cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow rep_good
bogomips : 3991.79
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 246
stepping : 10
cpu MHz : 2000.000
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow rep_good
bogomips : 3991.79
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Core was generated by `postgres: docsysweb CaduRx 192.168.134.1(49502) BIND '.
Program terminated with signal 11, Segmentation fault.
#0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
(gdb) bt
#0 0x00002b6ef7457f74 in memcpy () from /lib/libc.so.6
#1 0x000000000065fc88 in CopySnapshot (snapshot=0xa4e6e0) at tqual.c:1311
#2 0x000000000051cf34 in fmgr_sql (fcinfo=0x7fffb3f51160) at functions.c:299
#3 0x00000000005164cd in ExecMakeFunctionResult (fcache=0x10b4710,
econtext=0x10b4ba0, isNull=0x7fffb3f515af "", isDone=0x0) at
execQual.c:1351
#4 0x00000000005191ca in ExecEvalExprSwitchContext
(expression=0x2b6f2c5dc078, econtext=0x10, isNull=0x24c8500 <Address
0x24c8500 out of bounds>, isDone=0x24c84c0)
at execQual.c:3726
#5 0x00000000005690bd in evaluate_expr (expr=<value optimized out>,
result_type=23, result_typmod=-1) at clauses.c:3273
#6 0x0000000000569d21 in simplify_function (funcid=2214548,
result_type=23, result_typmod=-1, args=0xf8db40, allow_inline=1
'\001', context=0x7fffb3f51700) at clauses.c:2887
#7 0x000000000056a53a in eval_const_expressions_mutator
(node=0xf24d90, context=0x7fffb3f51700) at clauses.c:1795
#8 0x000000000056b185 in estimate_expression_value (root=<value
optimized out>, node=0x2b6f2c5dc078) at clauses.c:1719
#9 0x00000000005f8b86 in get_restriction_variable (root=0xf22d80,
args=<value optimized out>, varRelid=0, vardata=0x7fffb3f51810,
other=0x7fffb3f51848,
varonleft=0x7fffb3f5185f "\001\236") at selfuncs.c:3570
#10 0x00000000005fa562 in eqsel (fcinfo=<value optimized out>) at selfuncs.c:169
#11 0x000000000064540c in OidFunctionCall4 (functionId=<value
optimized out>, arg1=15871360, arg2=532, arg3=15879312, arg4=0) at
fmgr.c:1615
#12 0x000000000056d0c2 in restriction_selectivity (root=0xf22d80,
operator=532, args=0xf24c90, varRelid=0) at plancat.c:805
#13 0x000000000054e797 in clause_selectivity (root=0xf22d80,
clause=0xf24de0, varRelid=0, jointype=JOIN_INNER) at clausesel.c:639
#14 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#15 0x000000000054e448 in clause_selectivity (root=0xf22d80,
clause=0xf8bb20, varRelid=0, jointype=JOIN_INNER) at clausesel.c:576
#16 0x000000000054e72a in clause_selectivity (root=0xf22d80,
clause=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:597
#17 0x000000000054e0df in clauselist_selectivity (root=0xf22d80,
clauses=<value optimized out>, varRelid=0, jointype=JOIN_INNER) at
clausesel.c:123
#18 0x000000000054f99f in set_baserel_size_estimates
(root=0x2b6f2c5dc078, rel=0xf26748) at costsize.c:2262
#19 0x000000000054d67c in set_rel_pathlist (root=0xf22d80,
rel=0xf26748, rti=2, rte=0xec7c30) at allpaths.c:215
#20 0x000000000054df02 in make_one_rel (root=0xf22d80,
joinlist=0xf26668) at allpaths.c:150
#21 0x000000000055f710 in query_planner (root=0xf22d80, tlist=<value
optimized out>, tuple_fraction=0, limit_tuples=-1,
cheapest_path=0x7fffb3f52220,
sorted_path=0x7fffb3f52218, num_groups=0x7fffb3f52228) at planmain.c:249
#22 0x0000000000560041 in grouping_planner (root=0xf22d80,
tuple_fraction=<value optimized out>) at planner.c:897
#23 0x00000000005610b8 in subquery_planner (glob=0xee11a0,
parse=0xee1230, level=0, tuple_fraction=0, subroot=0x7fffb3f523e8) at
planner.c:431
#24 0x00000000005614b1 in standard_planner (parse=0xee1230,
cursorOptions=0, boundParams=0x0) at planner.c:158
#25 0x00000000005a1c61 in pg_plan_query (querytree=0xee1230,
cursorOptions=0, boundParams=0x0) at postgres.c:681
#26 0x00000000005a1d13 in pg_plan_queries (querytrees=<value optimized
out>, cursorOptions=0, boundParams=0x0, needSnapshot=0 '\0') at
postgres.c:752
#27 0x0000000000633f93 in do_planning (querytrees=0xf22d50,
cursorOptions=0) at plancache.c:560
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
#29 0x00000000005a3b05 in PostgresMain (argc=4, argv=<value optimized
out>, username=0x8f3020 "docsysweb") at postgres.c:1605
#30 0x0000000000579d8f in ServerLoop () at postmaster.c:3207
#31 0x000000000057a85c in PostmasterMain (argc=3, argv=0x8ee2d0) at
postmaster.c:1029
#32 0x00000000005345ce in main (argc=3, argv=<value optimized out>) at
main.c:188
gdb) frame 28
#28 0x000000000063436b in RevalidateCachedPlan (plansource=0xcd02f0,
useResOwner=0 '\0') at plancache.c:484
484 slist = do_planning(slist,
plansource->cursor_options);
(gdb) print plansource->query_string
$1 = 0xf08ea0 "select w.worker_id, w.worker_id as printerid, w.worker,
w.alias, coalesce(w.alias, w.worker) as name, w.active, w.last_active,
w.last_deactive, round(extract(epoch from now()) - extract(epoch from
w.l"...

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2008-03-12 04:59:12 Re: 8.3.0 backend segfaults
Previous Message Tom Lane 2008-03-12 01:01:32 Re: BUG #4027: backslash escaping not disabled in plpgsql