Re: hash index on unlogged tables doesn't behave as expected

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: amit(dot)kapila16(at)gmail(dot)com
Cc: michael(dot)paquier(at)gmail(dot)com, ashu(dot)coek88(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: hash index on unlogged tables doesn't behave as expected
Date: 2017-07-06 09:54:47
Message-ID: 20170706.185447.256482539.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote in <CAA4eK1+SYqCmA7ioTBpJHcO-B-rf8A=N9Gr1-RP3RhwecB5E-A(at)mail(dot)gmail(dot)com>
> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > It seems to me that this qualifies as an open item for 10. WAL-logging
> > is new for hash tables. Amit, could you add one?
> >
>
> Added.
>
> > + use_wal = RelationNeedsWAL(rel) ||
> > + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> > + forkNum == INIT_FORKNUM);
> > In access AMs, empty() routines are always only called for unlogged
> > relations, so you could relax that to check for INIT_FORKNUM only.
> >
>
> Yeah, but _hash_init() is an exposed API, so I am not sure it is a
> good idea to make such an assumption at that level of code. Now, as
> there is no current user which wants to use it any other way, we can
> make such an assumption, but does it serve any purpose?

Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.

The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.

> > Looking at the patch, I think that you are right to do the logging
> > directly in _hash_init() and not separately in hashbuildempty().
> > Compared to other relations the init data is more dynamic.
> >
> > + /*
> > + * Force the on-disk state of init forks to always be in sync with the
> > + * state in shared buffers. See XLogReadBufferForRedoExtended.
> > + */
> > + XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
> > + if (forknum == INIT_FORKNUM)
> > + FlushOneBuffer(metabuf);
> > I find those forced syncs a bit surprising. The buffer is already
> > marked as dirty, so even if there is a concurrent checkpoint they
> > would be flushed.
> >
>
> What if there is no concurrent checkpoint activity and the server is
> shutdown? Remember this replay happens on standby and we will just
> try to perform Restartpoint and there is no guarantee that it will
> flush the buffers. If the buffers are not flushed at shutdown, then
> the Init fork will be empty on restart and the hash index will be
> corrupt.

XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.

> > If recovery comes again here, then they would just
> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> > enough to replay a FPW of that.
> >
>
> Where does FPW come into the picture for a replay of metapage or bitmappage?

Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.

By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.

| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay. If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.

and does

| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode, strategy, &hit);

This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-07-06 10:08:55 Re: hash index on unlogged tables doesn't behave as expected
Previous Message tushar 2017-07-06 09:51:41 Re: increasing the default WAL segment size