Re: new heapcheck contrib module

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: new heapcheck contrib module
Date: 2020-05-14 18:32:53
Message-ID: CA+TgmoYTDcf5MJrSBCSB6iLnGzh4pE7nCBBVBYGP-7D0CwzuHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 13, 2020 at 5:33 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Do you recall seeing corruption resulting in segfaults in production?

I have seen that, I believe. I think it's more common to fail with
errors about not being able to palloc>1GB, not being able to look up
an xid or mxid, etc. but I am pretty sure I've seen multiple cases
involving seg faults, too. Unfortunately for my credibility, I can't
remember the details right now.

> I personally don't recall seeing that. If it happened, the segfaults
> themselves probably wouldn't be the main concern.

I don't really agree. Hypothetically speaking, suppose you corrupt
your only copy of a critical table in such a way that every time you
select from it, the system seg faults. A user in this situation might
ask questions like:

1. How did my table get corrupted?
2. Why do I only have one copy of it?
3. How do I retrieve the non-corrupted portion of my data from that
table and get back up and running?

In the grand scheme of things, #1 and #2 are the most important
questions, but when something like this actually happens, #3 tends to
be the most urgent question, and it's a lot harder to get the
uncorrupted data out if the system keeps crashing.

Also, a seg fault tends to lead customers to think that the database
has a bug, rather than that the database is corrupted.

Slightly off-topic here, but I think our error reporting in this area
is pretty lame. I've learned over the years that when a customer
reports that they get a complaint about a too-large memory allocation
every time they access a table, they've probably got a corrupted
varlena header. However, that's extremely non-obvious to a typical
user. We should try to report errors indicative of corruption in a way
that gives the user some clue that corruption has happened. Peter made
a stab at improving things there by adding
errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of
users will never see the error code, only the message, and a lot of
corruption produces still produces errors that weren't changed by that
commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-05-14 18:34:23 Re: [PATCH] Fix ouside scope t_ctid (ItemPointerData)
Previous Message Robert Haas 2020-05-14 18:16:29 Re: Our naming of wait events is a disaster.