Re: BUG #15745: WAL References Invalid Pages...that eventually resolves

From: Daniel Farina <daniel(at)citusdata(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15745: WAL References Invalid Pages...that eventually resolves
Date: 2019-05-06 03:37:02
Message-ID: CAOPfGFgJfz8R7eWnuE3duF4EvJrQYRboHN73=OMc1L=hyqOvHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Apr 27, 2019 at 8:28 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> Hi Daniel,
>
> On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
> <noreply(at)postgresql(dot)org> wrote:
> > But, for serendipitous reasons, I let this one run for a while. As it
> turns
> > out, with each crash, it would make *slightly* more progress than the
> time
> > before....and then eventually, it suffered no more faults and caught up
> > normally. Included is a log that shows how sparse these faults were,
> > relative to all the traffic going on....: roughly two per segment on this
> > workload, with large gaps between problematic segments, and not
> necessarily
> > repetition in a problematic relation or filenode.
>
> That sounds weird.
>

Yeah. It was.

> > The fact the standby eventually came up made me suspicious, so I ran
> amcheck
> > with a heap re-check, and, no tuples were in violation.
> >
> > Included is a log, which shows how the system recovered over and over,
> > making slight progress each time. This is the entire inventory after such
> > crashes: after these, the system passed amcheck and appears to work
> > normally.
>
> Did you try bt_index_parent_check('rel', true)? You might want to make
> sure that work_mem is set sufficiently high so that the
> downlink-block-is-present check is definitely effective; work_mem
> bounds the size of a Bloom filter used by the implementation (the heap
> verification option has its own Bloom filter, bound by
> maintenance_work_mem). Suggest that you "set
> client_min_messages=debug1" before running amcheck this way, just in
> case that shows something interesting.
>

No, I didn't want to take that lock...but I'll keep in it mind for next
time, though I'll have to make arrangements.

> postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
> > WARNING: page 162136064 of relation base/16385/21372 does not exist
>
> These WARNING messages all reference block numbers that look like
> 32-bits of random garbage, but could be from a very large relation.
>

Definitely not a 1.3TiB relation. Good eye.

The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
> collection (not VACUUM). Note that Andres changed this mechanism for
> v12, so that latestRemovedXid was calculated on the primary, rather
> than on the standby. I think that this error comes from
> btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
> master/12.
>
> I wonder, is "base/16385/21351" the index or the table? Is it possible
> to run pg_waldump? I think it's the table.
>
> If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
> is perhaps unsurprising that there isn't evidence of any lasting
> corruption.
>

Regrettably, it's too late, but I'll have my notes for next time.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bossart, Nathan 2019-05-06 16:08:47 Re: BUG #15788: 'pg_dump --create' orders database GRANTs incorrectly
Previous Message Euler Taveira 2019-05-06 01:46:08 Re: identity not working with inherited table