Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November
Date: 2015-03-23 16:35:36
Message-ID: 551040D8.4040104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.

Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:

open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10) = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("base/16384/28056", O_RDWR) = 10
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192

It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.

Attached a few seconds worth of strace from each backend.

> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.

Hmmm, this seems a bit more interesting - see the backtraces attached.

The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:

==================================================================
#7 AcceptInvalidationMessages () at inval.c:691
#8 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================

The other thing is that there's a fair amount of pointers resolved like
this:

#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010

while in the next backtrace from the same backend it looks like this:

#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010

That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.

And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
7953.log.gz application/gzip 46.0 KB
7956.log.gz application/gzip 50.5 KB
backtraces-7953.log text/x-log 95.2 KB
backtraces-7956.log text/x-log 70.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2015-03-23 16:40:36 Re: Auditing extension for PostgreSQL (Take 2)
Previous Message Andres Freund 2015-03-23 16:29:27 Re: PATCH: pgbench - merging transaction logs