could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From: Антон Степаненко <zlobnynigga(at)yandex(dot)ru>
To: pgsql-bugs(at)postgresql(dot)org
Subject: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
Date: 2011-06-16 20:10:01
Message-ID: 1208551308255001@web137.yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Greetings.

First and foremost - sorry for my bad english.
I have PostgreSQL 9.0.4 installation using streaming replication, 1 master and 10 replicas. I've migrated to it about a month ago, from 8.3 and slony. Built-in replication is wonderful, slony sucks =), but I have some performance issues. In order to manage with them I've tried lots of different things. One of them is rising shared_burffers from 8Gb to 12Gb (I have 24Gb at all). After postgres restart he crahed in about 3 hours with messages:

[2301-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 34691 in file "base/17931/407169": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2390-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 2242 in file "base/17931/18984": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT:[some statement]
[2302-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 34691 in file "base/17931/407169": read only 160 of 8192 bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2391-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2392-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2393-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2394-1] 2011-06-16 17:40:26 UTC ERROR: could not read block 25578 in file "base/17931/686571": Bad address
[2394-2] 2011-06-16 17:40:27 UTC STATEMENT: [some statement]
[4-1] 2011-06-16 17:40:27 UTC LOG: startup process (PID 15292) was terminated by signal 7: Bus error
[5-1] 2011-06-16 17:40:27 UTC LOG: terminating any other active server processes
[858-1] 2011-06-16 17:40:27 UTC WARNING: terminating connection because of crash of another server process
[858-2] 2011-06-16 17:40:27 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2429-1] 2011-06-16 17:40:27 UTC WARNING: terminating connection because of crash of another server process
[858-3] 2011-06-16 17:40:27 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.

Despite "In a moment you should be able to reconnect to the database and repeat your command" postgres did nor restart by itself.
Signal 7 means hardware problems. But all 10 replicas crashed within 10 minutes, say from 13:35 to 13:45. And when shared buffers were set to 8Gb I hadn't experienced such troubles.
I checked /proc/sys/kernel/shmmax - 16Gb, /proc/sys/kernel/shmall - 4194304.
I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. Then I tried vm.overcommit_memory=1 - no sense.
Here are vmstat logs (vmstat 10 > vmstat.log):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 5308644 0 0 0 0 508 133 0 743 2 1 97 0
0 0 0 5279436 0 0 0 0 620 307 0 1016 3 0 97 0
1 0 0 5266384 0 0 0 0 242 223 0 1104 8 1 92 0
1 0 0 5260544 0 0 0 0 359 287 0 959 2 0 98 0
0 0 0 5245764 0 0 0 0 877 311 0 891 3 0 97 0
0 0 0 24378380 0 0 0 0 454 314 0 1011 2 7 91 0
0 0 0 24378380 0 0 0 0 0 69 0 118 0 0 100 0
0 0 0 24378380 0 0 0 0 0 2 0 99 0 0 100 0
0 0 0 24378380 0 0 0 0 0 1 0 180 0 0 100 0
0 0 0 24378024 0 0 0 0 0 1 0 116 0 0 100 0
0 0 0 24378024 0 0 0 0 0 342 0 126 0 0 100 0

So you can see that postgres dies when there are about 5Gb of free memory. In one of my experiments postgres worked for about 30 minutes with 0kb free memory. So I think problem is not memory lack.
I thought about bad indexes and reindex. In messages like "could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes" YYYYY is my database and ZZZZZ somehow relates to a table or to an index. It may be relfilenode of an index or oid of a table. And indexes and tables can be of any type, there are no regularity. I haven't tried to reindex, because after all 10 replicas went down. I made all experiments on only one of them, while reindexing will affect all installation, which is in production and under high load.
One important thing - all replicas and master are running on openvz ubuntu lucid 2.6.32, and there is no way to reject virtualization (it is a long story =))

Please, I do not want to discuss my decision to set buffers to 12Gb and postgresql optimization at all. I just want to undestand why I'm getting such errors.
I will appreciate any help, thank you in advance.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2011-06-16 20:28:48 Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
Previous Message Robert Haas 2011-06-16 19:52:16 Re: BUG #6039: Trying to use server status