Accidental removal of a file causing various problems

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Accidental removal of a file causing various problems
Date: 2018-08-24 18:14:38
Message-ID: CABOikdPF6L+z+1YPNPPfEvWNmCHtvsF6RDPv1X_F1wVwstc_Ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello All,

One of our customers reported a situation where one of the many segments
backing a table went missing. They don't know why that happened and we
couldn't determine if this could be a PostgreSQL bug or FS bug or simply an
operator error.

The very first errors seen in the logs were:

WARNING: could not write block 27094010 of base/56972584/56980980
DETAIL: Multiple failures --- write error might be permanent.
ERROR: could not open file "base/56972584/56980980.69" (target block
27094010): previous segment is only 12641 blocks
CONTEXT: writing block 27094010 of relation base/56972584/56980980

As I said, while we don't know why the file "base/56972584/56980980.69"
went missing, what happens thereafter is also very strange:

1. The user soon found out that they can no longer connect to any database
in the cluster. Not just the one to which the affected table belonged, but
no other database in the cluster. The affected table is a regular user
table (actually a toast table).
2. So they restarted the database server. While that fixed the connection
problem, they started seeing toast errors on the table to which the missing
file belonged to. The missing file was recreated at the database restart,
but of course it was filled in with all zeroes, causing data corruption.
3. To make things worse, the corruption then got propagated to the standbys
too. We don't know if the original file removal was replicated to the
standby, but it seems unlikely.

I've a test case that reproduce all of these effects if a backend file is
forcefully removed, but I haven't had enough time to figure out if 1) the
file removal is a bug in PostgreSQL and 2) whether any of the observed side
effects point to a potential bug.

Is this worth pursuing? Or are these side effects are well understood and
known? IMHO even if we accept that we can't do much about a missing file,
it seems quite odd that both 1 and 3 happens.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-24 18:34:00 About those snprintf invocation macros
Previous Message Andres Freund 2018-08-24 18:12:55 Re: Windows vs C99 (was Re: C99 compliance for src/port/snprintf.c)