Re: BUG #7500: hot-standby replica crash after an initial rsync

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Date: 2012-08-22 04:18:14
Message-ID: CAK-MWwTfoG_HcUhFk2KVm1R8XPWEq1AEw3JWuodWNq76GUBm8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Aug 22, 2012 at 1:50 PM, Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com> wrote:

>
>
> On Wed, Aug 22, 2012 at 6:08 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
>
>> On Tuesday, August 21, 2012 03:30:44 PM Maxim Boguk wrote:
>> > Hi Andres,
>>
>> I would add something akin to
>>
>> elog(WARNING, "pid of startup is: %d, sleeping for 10s", getpid());
>> sleep(10);
>>
>
> Hi Andres,
>
> There is it:
>
> gdb backtrace:
>
> #0 0x0000000800bd1ab0 in exit () from /lib/libc.so.7
> #1 0x00000000005d77f6 in proc_exit (code=1) at ipc.c:137
> #2 0x0000000000697948 in errfinish (dummy=<optimized out>) at elog.c:501
> #3 0x0000000000488956 in SlruReportIOError (ctl=<optimized out>,
> pageno=<optimized out>, xid=361788644) at slru.c:857
> #4 0x000000000048956a in SimpleLruReadPage (ctl=0x95bf80, pageno=176654,
> write_ok=1 '\001', xid=361788644) at slru.c:427
> #5 0x0000000000489f7f in RecordNewMultiXact (multi=361788644,
> offset=822580965, nxids=2, xids=0x800e5302c) at multixact.c:764
> #6 0x000000000048a126 in multixact_redo (lsn=..., record=0x800e53000) at
> multixact.c:2017
> #7 0x0000000000484bea in StartupXLOG () at xlog.c:6277
> #8 0x00000000004870a8 in StartupProcessMain () at xlog.c:9335
> #9 0x000000000049cab2 in AuxiliaryProcessMain (argc=2,
> argv=0x7fffffffe450) at bootstrap.c:412
> #10 0x00000000005ba926 in StartChildProcess (type=StartupProcess) at
> postmaster.c:4442
> #11 0x00000000005bccb2 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
> postmaster.c:1093
> #12 0x0000000000566f3b in main (argc=5, argv=0x7fffffffea38) at main.c:188
>
> relevant database log:
> 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOG: 00000:
> redo starts at 925/2C0004C8
> 2012-08-22 07:47:21.330 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6202
> 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-22 07:47:21.384 MSK 18155 @ from [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
>
>
> Kind Regards,
> Maksym
>

And finally gdb backtrace on build without optimization
( ./configure --prefix=/usr/local/pgsql/tmp/postgresql/ --enable-debug
--enable-cassert CFLAGS='-O0' ):

#0 0x0000000800d42ab0 in exit () from /lib/libc.so.7
#1 0x00000000006aafc6 in proc_exit (code=1) at ipc.c:137
#2 0x00000000007bcbb1 in errfinish (dummy=0) at elog.c:501
#3 0x00000000004b7065 in SlruReportIOError (ctl=0xad9740, pageno=176654,
xid=361788644) at slru.c:836
#4 0x00000000004b6353 in SimpleLruReadPage (ctl=0xad9740, pageno=176654,
write_ok=1 '\001', xid=361788644) at slru.c:427
#5 0x00000000004b85dd in RecordNewMultiXact (multi=361788644,
offset=822580965, nxids=2, xids=0x800f5302c) at multixact.c:764
#6 0x00000000004b9dd1 in multixact_redo (lsn=..., record=0x800f53000) at
multixact.c:2017
#7 0x00000000004aef0c in StartupXLOG () at xlog.c:6277
#8 0x00000000004b446d in StartupProcessMain () at xlog.c:9335
#9 0x00000000004d9e36 in AuxiliaryProcessMain (argc=2,
argv=0x7fffffffe8b0) at bootstrap.c:412
#10 0x0000000000683398 in StartChildProcess (type=StartupProcess) at
postmaster.c:4442
#11 0x000000000067ef60 in PostmasterMain (argc=5, argv=0x7fffffffea38) at
postmaster.c:1093
#12 0x00000000005fbf5a in main (argc=5, argv=0x7fffffffea38) at main.c:188

It seems failing around function SlruPhysicalReadPage at block

errno = 0;
if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
{
slru_errcause = SLRU_READ_FAILED;
slru_errno = errno;
close(fd);
return false;
}

It seems that it read less than BLCKSZ bytes. And it is not an error
condition for read call
(from man 2 read:
Note that reading less than size bytes is not an error.
).

Actual size of pg_multixact/offsets/1590 file is:
[pgsql(at)db1 ~/tmp/postgresql-9.0.8]$ ls -la
/db/data_back/pg_multixact/offsets/1590
-rw------- 1 pgsql pgsql 81920 Aug 21 07:46
/db/data_back/pg_multixact/offsets/1590

Now it seems that the (lseek(fd, (off_t) offset, SEEK_SET) < 0) call moving
seek position to outside of an actual file size.
And after that, read call actually could read only 0 bytes from it.

Does that mean that file was damaged during rsync?

Kind Regards,
Maksym

What next I should examine?

Kind Regards,
Maksym

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2012-08-22 09:33:35 Re: BUG #7500: hot-standby replica crash after an initial rsync
Previous Message Maxim Boguk 2012-08-22 03:50:09 Re: BUG #7500: hot-standby replica crash after an initial rsync