Quick Links

Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November
Date:	2015-03-23 16:35:36
Message-ID:	551040D8.4040104@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.

Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:

open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10) = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("base/16384/28056", O_RDWR) = 10
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192

It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.

Attached a few seconds worth of strace from each backend.

> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.

Hmmm, this seems a bit more interesting - see the backtraces attached.

The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:

==================================================================
#7 AcceptInvalidationMessages () at inval.c:691
#8 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================

The other thing is that there's a fair amount of pointers resolved like
this:

#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010

while in the next backtrace from the same backend it looks like this:

#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010

That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.

And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
7953.log.gz	application/gzip	46.0 KB
7956.log.gz	application/gzip	50.5 KB
backtraces-7953.log	text/x-log	95.2 KB
backtraces-7956.log	text/x-log	70.5 KB

In response to

Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November at 2015-03-23 15:02:03 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Steele	2015-03-23 16:40:36	Re: Auditing extension for PostgreSQL (Take 2)
Previous Message	Andres Freund	2015-03-23 16:29:27	Re: PATCH: pgbench - merging transaction logs