Re: Troubleshooting a segfault and instance crash

From: Blair Boadway <bboadway(at)abebooks(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: Troubleshooting a segfault and instance crash
Date: 2018-03-27 22:47:54
Message-ID: 616D192F-6ACA-49B7-A5F5-0D853E6FD001@abebooks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

As a follow up, we’ve been able to get the same back trace implicating pg_hint_plan from 2 separate crashes. We were using pg_hint_plan 1.2.2--we reported the issue to pg_hint_plan github. We’ve removed pg_hint_plan and it looks like the system will no longer segfault under the same conditions. This strongly suggests pg_hint_plan was the root cause of our issue but we can’t yet be 100% certain as the issue was always transient.

-Blair

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Date: Saturday, March 24, 2018 at 9:18 PM
To: Blair Boadway <bboadway(at)abebooks(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Troubleshooting a segfault and instance crash

2018-03-25 0:41 GMT+01:00 Blair Boadway <bboadway(at)abebooks(dot)com<mailto:bboadway(at)abebooks(dot)com>>:
Thanks for the tip. We are using RHEL 6.9 and definitely up to date on glibc (2.12-1.209.el6_9.2). We also have the same versions on a very similar system with no segfault.

My colleague got a better backtrace that shows another extension

Core was generated by `postgres: batch_user_account''.
Program terminated with signal 11, Segmentation fault.
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
#1 0x00007fa3f0c7074c in get_query_string (pstate=<value optimized out>, query=<value optimized out>, jumblequery=<value optimized out>) at pg_hint_plan.c:1882
#2 0x00007fa3f0c70a5d in pg_hint_plan_post_parse_analyze (pstate=0x25324b8, query=0x25325e8) at pg_hint_plan.c:2875
#3 0x00000000005203bc in parse_analyze ()
#4 0x00000000006df933 in pg_analyze_and_rewrite ()
#5 0x00000000007c6f6b in ?? ()
#6 0x00000000007c6ff0 in CachedPlanGetTargetList ()
#7 0x00000000006e173a in PostgresMain ()
#8 0x00000000006812f5 in PostmasterMain ()
#9 0x0000000000609278 in main ().

We aren’t sure if this indicates that pg_hint_plan is causing the segfault or if it happened to be doing something when the segfault occurred. We aren’t actually using pg_hint_plan hints in this system so we’re not sure how all this relates to segfault when another process does a ‘grant usage on schema abc to user xyz;’ unrelated to the account segfaulting.

although you don't use pg_hint_plan explicitly, pg_hint_plan is active - it is active via planner callbacks

Short of better ideas, we will pull the pg_hint_plan extension and see if that removes the problem.

please, try to report this back trace to pg_hint_plan authors.
Regards
Pavel

-Blair

From: Peter Geoghegan <pg(at)bowt(dot)ie<mailto:pg(at)bowt(dot)ie>>
Date: Saturday, March 24, 2018 at 4:18 PM
To: Blair Boadway <bboadway(at)abebooks(dot)com<mailto:bboadway(at)abebooks(dot)com>>
Cc: "pgsql-general(at)postgresql(dot)org<mailto:pgsql-general(at)postgresql(dot)org>" <pgsql-general(at)postgresql(dot)org<mailto:pgsql-general(at)postgresql(dot)org>>
Subject: Re: Troubleshooting a segfault and instance crash

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway <bboadway(at)abebooks(dot)com<mailto:bboadway(at)abebooks(dot)com>> wrote:
Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
000000302f32868a sp 00007ffcf1547498 error 4 in
libc-2.12.so<http://libc-2.12.so>[302f200000+18a000]

Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any
apparent issues. This has happened 3 times in 2 months and each time the
segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,
that seemed like it might originate from within its glibc [1].
Although all the versions there didn't match what you have, it's worth
considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have
the necessary debuginfo repos set up. Just a guess. That is sometimes
a required extra step.

[1] https://postgr.es/m/7369.1520528405@sss.pgh.pa.us
--
Peter Geoghegan

In response to

Browse pgsql-general by date

  From Date Subject
Next Message legrand legrand 2018-03-27 22:51:59 Re: Postgres Foreign Data Wrapper and DB2 LUW
Previous Message armand pirvu 2018-03-27 22:36:12 connection dropped from the backend server