Re: segfault in hot standby for hash indexes

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: segfault in hot standby for hash indexes
Date: 2017-03-21 16:23:06
Message-ID: CAMkU=1zdP6jX_afiMc8yJWqjUS6K0xCxbrNhm2O7QyLmFHRzvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 21, 2017 at 4:00 AM, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
wrote:

> Hi Jeff,
>
> On Tue, Mar 21, 2017 at 1:54 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > On Tue, Mar 21, 2017 at 1:28 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> wrote:
> >> Against an unmodified HEAD (17fa3e8), I got a segfault in the hot
> standby.
> >>
> >
> > I think I see the problem in hash_xlog_vacuum_get_latestRemovedXid().
> > It seems to me that we are using different block_id for registering
> > the deleted items in xlog XLOG_HASH_VACUUM_ONE_PAGE and then using
> > different block_id for fetching those items in
> > hash_xlog_vacuum_get_latestRemovedXid(). So probably matching those
> > will fix this issue (instead of fetching block number and items from
> > block_id 1, we should use block_id 0).
> >
>
> Thanks for reporting this issue. As Amit said, it is happening due to
> block_id mismatch. Attached is the patch that fixes the same. I
> apologise for such a silly mistake. Please note that I was not able
> to reproduce the issue on my local machine using the test script you
> shared. Could you please check with the attached patch if you are
> still seeing the issue. Thanks in advance.
>

Hi Ashutosh,

I can confirm that that fixes the seg faults for me.

Did you mean you couldn't reproduce the problem in the first place, or that
you could reproduce it and now the patch fixes it? If the first of those,
I forget to say you do have to wait for hot standby to reach a consistency
and open for connections, and then connect to the standby ("psql -p 9874"),
before the seg fault will be triggered.

But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges
from btree_xlog_delete_get_latestRemovedXid, which I don't understand the
reason for the divergence. Is there a reason we dropped the PANIC if we
have not reached consistency? That is a case which should never happen,
but it seems worth preserving the PANIC. And why does this code get
'unused' from XLogRecGetBlockData(record, 0, &len), while the btree code
gets it from xlrec? Is that because the record being replayed is
structured differently between btree and hash, or is there some other
reason?

Thanks,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-21 16:33:35 Re: Partitioned tables and relfilenode
Previous Message Andres Freund 2017-03-21 16:21:27 Re: Patch: Write Amplification Reduction Method (WARM)