Re: hash index on unlogged tables doesn't behave as expected

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: amit(dot)kapila16(at)gmail(dot)com
Cc: robertmhaas(at)gmail(dot)com, michael(dot)paquier(at)gmail(dot)com, ashu(dot)coek88(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: hash index on unlogged tables doesn't behave as expected
Date: 2017-07-10 09:57:40
Message-ID: 20170710.185740.203923467.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in <CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw(at)mail(dot)gmail(dot)com>
> On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in <CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw(at)mail(dot)gmail(dot)com>
> >> On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> > On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >> >> I think we should do that as a separate patch (I can write the same as
> >> >> well) because that is not new behavior introduced by this patch, ...
> >> >
> >> > True, although maybe hash indexes are the only way that happens today?
> >> >
> >>
> >> No, I think it will happen for all other indexes as well. Basically,
> >> we log new page for init forks of other indexes and then while
> >> restoring that full page image, it will fall through the same path.
> >
> > (Sorry, I didn't noticed that hash metapage is not using log_newpgae)
> >
>
> The bitmappage is also not using log_newpage.
>
> > For example, (bt|gin|gist|spgist|brin)buildempty are using
> > log_newpage to log INIT_FORK metapages. I believe that the xlog
> > flow from log_newpage to XLogReadBufferForRedoExtended is
> > developed so that pages in init forks are surely flushed during
> > recovery, so we should fix it instead adding other flushes if the
> > path doesn't work. Am I looking wrong place? (I think it is
> > working.)
> >
>
> You are looking at right place.
>
> > If I'm understanding correctly here, hashbild and hashbuildempty
> > should be refactored using the same structure with other *build
> > and *buildempty functions. Specifically metapages initialization
> > subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
> > or...) does only on-memory initialization and does not emit WAL,
> > then *build and *buildempty emits WAL in their required way.
> >
>
> I have considered this way of doing as well, read my first e-mail [1]
> in this thread "Another approach to fix the issue ....". It is not
> that we can't change the code of hashbuildempty to make it log new
> pages for all kind of pages (metapage, bitmappage and datapages), but
> I am not sure if there is a value in going in that direction. If we
> have to go in that direction, we need to hard code some of the values
> like hashm_nmaps and hashm_mapp in metapage rather than initializing
> them after bitmappage creation. Before going in that direction, I
> think we should also take opinion from other people especially some
> committer as we might need to maintain two different ways of doing
> almost the same thing.

Thanks for the explanation and the pointer (to the start of this
thread.. sorry).

> I am also not 100% comfortable with adding flush at two new places,
> but that makes the code for fix simpler and fundamentally there is no
> problem in doing so.

I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.

> There is a small downside to always logging new page which is that it
> will always log full page image in WAL which has the potential to
> increase WAL volume if there are many unlogged tables and indexes.
> Now considering the init fork generally has very less pages, it is not
> a big concern, but still avoiding full page image has some value.

Perhaps more effective mechanism will be developed at the time.

> >> >>> By the way the comment of the function ReadBufferWithoutRelcache
> >> >>> has the following sentense.
> >> >>>
> >> >>> | * NB: At present, this function may only be used on permanent relations, which
> >> >>> | * is OK, because we only use it during XLOG replay. If in the future we
> >> >>> | * want to use it on temporary or unlogged relations, we could pass additional
> >> >>> | * parameters.
> >> >>>
> >> >>> and does
> >> >>>
> >> >>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
> >> >>> mode, strategy, &hit);
> >> >>>
> >> >>> This surely works since BufferAlloc recognizes INIT_FORK. But
> >> >>> some adjustment may be needed around here.
> >> >>
> >> >> Yeah, it should probably mention that the init fork of an unlogged
> >> >> relation is also OK.
> >> >>
> >> >
> >> > I think we should do that as a separate patch (I can write the same as
> >> > well) because that is not new behavior introduced by this patch, but
> >> > let me know if you think that we should add such a comment in this
> >> > patch itself.
> >> >
> >>
> >> Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
> >
> > + * NB: At present, this function may only be used on permanent relations and
> > + * init fork of an unlogged relation, which is OK, because we only use it
> > + * during XLOG replay. If in the future we want to use it on temporary or
> > + * unlogged relations, we could pass additional parameters.
> >
> > *I* think the target of the funcion is permanent relations and
> > init forks, not unlogged relations. And I'd like to have an
> > additional comment about RELPERSISTENCE_PERMANENT. Something like
> > the attached.
> >
>
> Okay, let's leave this for committer to decide.

Agreed, thanks.

> [1] - https://www.postgresql.org/message-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w%40mail.gmail.com

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2017-07-10 10:10:12 Re: COPY vs. transition tables
Previous Message Thomas Munro 2017-07-10 09:39:09 Re: POC: Sharing record typmods between backends