Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From: Антон Степаненко <zlobnynigga(at)yandex(dot)ru>
To: Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
Date: 2011-06-17 13:51:00
Message-ID: 438511308318660@web144.yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

17.06.2011, 00:28, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>:
> ***** **********<zlobnynigga(at)yandex(dot)ru>; wrote:
>
>>  [4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292)
>>  was terminated by signal 7: Bus error
>>  Signal 7 means  hardware problems. But all 10 replicas crashed
>>  within 10 minutes, say from 13:35 to 13:45.
>>  One important thing - all replicas and master are running on
>>  openvz
>
> Were the PostgreSQL clusters sharing any hardware?
>
>>  there is no way to reject virtualization (it is a long story =))
>>
>>  Please, I do not want to discuss my decision to set buffers to
>>  12Gb and postgresql optimization at all. I just want to undestand
>>  why I'm getting such errors.
>
> On the face of it, the most likely cause would seem to be hardware
> or the virtual environment.  Without knowing more about the exact
> messages on the replicas and how they compared to each other and the
> master it's hard to know whether any of the replica failures were
> from passing corrupted data from the master to the replicas, versus
> having a common hardware/vm flaw.
>
> -Kevin

I noticed that crash takes place when shared buffers are almost full, i.e. SELECT SUM(size) FROM adm.buffercache() returns 11670 at about one minute before crash. Furthermore, last night I set buffers to 11Gb, at it is working, no crash, all buffers are used (11120).
I still do not believe that this is hardware problem. Each replica and master runs on dedicated server, no hardware is shared. There is only postgresql on each server, no any other software(just crond, zabbix, atop).
Actually openvz is used only for portability(easily add new replicas or migrate one of them to new server).
Messages on replicas are all the same: "could not read block", then "signal 7". I copypasted error log as is, that is all I know.
Master did not crash, I think because it processes less SELECT queries, therefore his buffers do not reach limit.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2011-06-17 14:20:47 Re: BUG #6064: != NULL, <> NULL do not work
Previous Message Christoph Berg 2011-06-17 11:10:34 Re: BUG #6066: [PATCH] Mark more strings as c-format