Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
Date: 2018-08-09 01:42:56
Message-ID: CAH2-WzmfC3WuVKCvBtS1agKs5kRu8uKnp_+VReYgD8D1XD0v5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
>> but I don't immediatley know where.
>
> Hm, there was:
> http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de
>
>
> I don't immediately see it being responsible, but I wonder if there's a
> chance it actually is: Note that it happens in a parallel group that
> includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
> don't immediately see how it could apply.

It's now pretty clear that it was not that particular bug, since I
pushed a fix, and yet the issue hasn't gone away on affected buildfarm
animals. There was a recurrence of the problem on lapwing, for example
[1].

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

My guess is that the metapage considers the root page to be at block 3
(block 3 is often the root page for small though not tiny B-Trees),
which for whatever reason is where we get a short read. I don't know
why there is a short read, but corrupting mapped catalog indexes at
random can be expected to cause all kinds of chaos, so that doesn't
mean much.

In any case, I'll probably push a fix for this other bug on Friday,
barring any objections. It's possible that that will make the problem
go away.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2018-08-04%2004%3A20%3A01
[2] https://www.postgresql.org/message-id/CAH2-Wzn=j0i8rxCAo6E=tBO9XuYXb8HbUsnW7J_StKON8dDOhQ@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-09 02:40:51 Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
Previous Message Andres Freund 2018-08-09 01:15:39 Re: Why do we expand tuples in execMain.c?