database corruption

From: Oliver Stöneberg <oliverst(at)online(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: database corruption
Date: 2016-02-12 09:56:04
Message-ID: 56BDAC34.28370.54ED05E7@oliverst.online.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We are running a 64-bit PostgreSQL 9.4.5 server on Windows Server
2012. The system is a virtual machine on a VMware ESX 6.0 server and
has 24 GB of memory. The database server is only accessed locally by
two services and there is only a single database in the server. The
disk is located on a storage that is shared with lots of other
servers. The database server has fsync enabled.

A few weeks ago we already had a data corruption when the disk was
full. There are other services running on the same machine that could
cause the disk to fill up (e.g. local chaching when the network is
acting up). It happened a few times so far but the database was never
compromised. In that case thought it was but fortunately we only lost
a huge table/toast (300+ GB) that has very verbose data stored which
is not essential. That happened with an earlier 9.4 version.

Today we encountered another data corruption after the disk was full.
It's much worse this time around since data that is essential for the
applications using it to run. After truncating that 300+ GB table
already mentioned above all the services were restarted and one of
the applications failed to start with the following database error:

Caused by: org.postgresql.util.PSQLException: ERROR: invalid memory alloc request size 18446744073709551613
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:622)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:472)
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:386)
at joulex.cloud.data.tables.JxObjectStoreJsonTable.getAllJsonsAsHashMap(JxObjectStoreJsonTable.java:215)
... 16 more

Checking the database logs the problems seem to have started two days
ago:
2016-02-10 16:00:34 GMTERROR: invalid page in block 1255 of relation base/16387/3634911224
2016-02-10 16:00:34 GMTCONTEXT: automatic vacuum of table "cloud.t_63d1c1eb806f4e25904b659b75176093.y2016_utilization"

For some reason it doesn't seem to be caused by the disk being full
since the database server was still able to write another 8 GB of log
files for the next two days and the system was still working
yesterday afternoon.
It also doesn't appear to be a hardware problem since all the other
systems sharing the virtual hostsystem and the storage show no issues
at all.

Unfortunately we don't have a recent backup of the database (a tool
to back up all the relevant data was just finished recently and was
not set up for this system yet).

Something else worth noting is that we disabled the autovacuum on the
toast table of the 300+ GB table since we perform INSERT INTO on that
tbale and the vacuum on the table was causing a performance hit. The
autovacuum for it is still being performed to prevent wraparound from
time to and that autovacuum was still running after the machine run
out of disk space and the services was restarted.

Any help in recovering the data is appreciated and if there is more
information necessary on this I will try to provide it. Thanks in
advance.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message bigkev 2016-02-12 10:06:41 Re: ERROR: missing FROM-clause entry for table
Previous Message Teodor Sigaev 2016-02-12 09:37:36 Re: Test CMake build