Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes
Date: 2018-08-30 21:57:12
Message-ID: 20180830215711.GW23024@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 30, 2018 at 05:30:30PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby(at)telsasoft(dot)com> writes:
> > On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> >> As far as we can tell, that bug is a dozen years old, so it's not clear
> >> why you find that you can reproduce it only in 10.5. But there might be
> >> some subtle timing change accounting for that.
>
> > It seems to me there's one root problem occurring in (at least) two slightly
> > different ways. The issue/symptom that I've been seeing occurs in 10.5 but not
> > 10.4, and specifically at commit 2ce64ca, but not before.
>
> Yeah, as you probably saw in the other thread, we later realized that
> 2ce64ca created an additional pathway for ScanPgRelation to recurse;
> a pathway that's evidently easier to hit than the pre-existing ones.
> I note that both of your stack traces display ScanPgRelation recursion,
> so I'm feeling pretty confident that what you're seeing is the same
> thing.
>
> But, as Andres says, it'd be great if you could confirm whether the
> draft patches fix it for you.

I tested with relcache-rebuild.diff which hasn't broken in 15min, so I'm
confident that doesn't hit the additional recusive pathway, but have to wait
awhile and see if autovacuum survives, too.

I tried to apply fix-missed-inval-msg-accepts-1.patch on top of PG10.5 but
patch didn't apply, so I can test HEAD after the first patch soaks awhile.

Just curious, is there really any difficulty in reproducing this? Once I
realized this was a continuing issue and started to suspect pg10.5, it takes
just about nothing to reproduce anywhere I've tried. I just tested 5 servers,
and only one took more than a handful of seconds to fail. I gave up waiting
for a 6th server, because I found it was waiting on a pre-existing lock.

[pryzbyj(at)database ~]$ while :; do for a in pg_class_oid_index pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; done&
[pryzbyj(at)database ~]$ a=0; time while psql ts -qc ''; do a=$((1+a)); done ; echo "$a"
psql: FATAL: could not read block 0 in file "base/16400/313581263": read only 0 of 8192 bytes

real 0m1.772s
user 0m0.076s
sys 0m0.116s
47

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-30 22:03:27 Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes
Previous Message Tom Lane 2018-08-30 21:53:25 Re: Use C99 designated initializers for some structs