Re: hung backends stuck in spinlock heavy endless loop

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-16 14:22:27
Message-ID: 20150116142227.GF16991@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> > On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> >> Running this test on another set of hardware to verify -- if this
> >> turns out to be a false alarm which it may very well be, I can only
> >> offer my apologies! I've never had a new drive fail like that, in
> >> that manner. I'll burn the other hardware in overnight and report
> >> back.
>
> huh -- well possibly. not. This is on a virtual machine attached to a
> SAN. It ran clean for several (this is 9.4 vanilla, asserts off,
> checksums on) hours then the starting having issues:

Damn.

Is there any chance you can package this somehow so that others can run
it locally? It looks hard to find the actual bug here without adding
instrumentation to to postgres.

> [cds2 21952 2015-01-15 22:54:51.833 CST 5502]WARNING: page
> verification failed, calculated checksum 59143 but expected 59137 at
> character 20
> [cds2 21952 2015-01-15 22:54:51.852 CST 5502]QUERY:
> DELETE FROM "onesitepmc"."propertyguestcard" t
> WHERE EXISTS
> (
> SELECT 1 FROM "propertyguestcard_insert" d
> WHERE (t."prptyid", t."gcardid") = (d."prptyid", d."gcardid")
> )

> [cds2 21952 2015-01-15 22:54:51.852 CST 5502]CONTEXT: PL/pgSQL
> function cdsreconciletable(text,text,text,text,boolean) line 197 at
> EXECUTE statement
> SQL statement "SELECT * FROM CDSReconcileTable(
> t.CDSServer,
> t.CDSDatabase,
> t.SchemaName,
> t.TableName)"
> PL/pgSQL function cdsreconcileruntable(bigint) line 35 at SQL statement

This was the first error? None of the 'could not find subXID' errors
beforehand?

> [cds2 32353 2015-01-16 04:40:57.814 CST 7549]WARNING: did not find
> subXID 7553 in MyProc
> [cds2 32353 2015-01-16 04:40:57.814 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.018 CST 7549]WARNING: you don't own a
> lock of type AccessShareLock
> [cds2 32353 2015-01-16 04:40:58.018 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]LOG: could not send data
> to client: Broken pipe
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]STATEMENT: SELECT
> CDSReconcileRunTable(1160)
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
> ReleaseLockIfHeld: failed??
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a
> lock of type AccessShareLock
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING:
> ReleaseLockIfHeld: failed??
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]WARNING: you don't own a
> lock of type AccessShareLock
> [cds2 32353 2015-01-16 04:40:58.026 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
> ReleaseLockIfHeld: failed??
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a
> lock of type AccessShareLock
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
> ReleaseLockIfHeld: failed??
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING: you don't own a
> lock of type ShareLock
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
> ReleaseLockIfHeld: failed??
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]ERROR: failed to re-find
> shared lock object
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]CONTEXT: PL/pgSQL
> function cdsreconcileruntable(bigint) line 35 during exception cleanup
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]STATEMENT: SELECT
> CDSReconcileRunTable(1160)
> [cds2 32353 2015-01-16 04:40:58.027 CST 7549]WARNING:
> AbortSubTransaction while in ABORT state

This indicates a bug in our subtransaction abort handling. It looks to
me like there actually might be several. But it's probably a consequence
of an earlier bug. It's hard to diagnose the actual issue, because we're
not seing the original error(s) :(.

Could you add a EmitErrorReport(); before the FlushErrorState() in
pl_exec.c's exec_stmt_block()?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2015-01-16 14:38:56 Re: hung backends stuck in spinlock heavy endless loop
Previous Message Merlin Moncure 2015-01-16 14:21:33 Re: hung backends stuck in spinlock heavy endless loop