Re: Error with index on unlogged table

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(dot)paquier(at)gmail(dot)com
Cc: robertmhaas(at)gmail(dot)com, thom(at)linux(dot)com, andres(at)2ndquadrant(dot)com, fabriziomello(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error with index on unlogged table
Date: 2015-12-01 06:06:42
Message-ID: 20151201.150642.258467659.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSMENEK7nqmwGsiyLTSbrZNJKx80tBX3qF6cQsS49sjag(at)mail(dot)gmail(dot)com>
> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Hello, I studied your latest patch.
>
> Thanks!
>
> > I feel quite uncomfortable that it solves the problem from a kind
> > of nature of unlogged object by arbitrary flagging which is not
> > fully corresponds to the nature. If we can deduce the necessity
> > of fsync from some nature, it would be preferable.
>
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well. And that's actually
> If you look at for example BRIN indexes that do not sync immediately
> their INIT_FORKNUM when index is created, I think that we still are
> going to need a new flag to control the sync at WAL replay because
> startup process cannot know a relation's persistence, thing that we
> can know when the XLOG_FPI record is created. For BRIN indexes, we
> want particularly to not sync the INIT_FORKNUM when the relation is
> not an unlogged one.

(The comment added in brinbuildempty looks wrong since it
actually doesn't fsync it immediately..)

Hmm, I've already seen that, and having your explanation I wonder
why brinbuidempty issues WAL for what is not necessary to be
persistent at the mement. Isn't it breaking agreements about
Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would
be equally tied excluding the anormally about WAL. (Except for
succeeding newpages.)

> > In the current patch, is_sync for log_newpage is generally true
> > for and only for INIT_FORKNUM pages. Exceptions as far as I can
> > see are,
>
> Yep, that's not always the case.
>
> > copy_relation_data: called with arbitrary forknum but it doesn't
> > set is_fsync even for coying INIT_FORKNUM. (Is this not a
> > problem?)
>
> Oh. And actually, it seems that use_wal is broken there as well. I
> think that we would still want to issue a XLOG_FPI record for an
> unlogged relation should it be moved to a new tablespace to copy its
> INIT_FORKNUM correctly to its new home.

Agreed.

> After moving an unlogged relation to a new tablespace, and after
> promoting a standby the standby fails to start because of this error:
> FATAL: could not create file
> "pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists
> This could be fixed separately though.
>
> > spgbuildempty, ginbuildempty: these emits two or three newpage
> > logs at once so only the last one is set is_fsync for
> > performance reason.
>
> It doesn't matter to fsync just at the last one. Each one of them
> would be replayed either way, the last one triggering the sync, no?

Yes, it does (except for some unreal case below). It was just a
confirmation and I didn't see this as a problem at all. Sorry for the
ambiguous writing.

> > In short, it seems to me that the reason to choose using
> > XLOG_FPI_FOR_SYNC here is only performance of processing
> > successive FPIs for INIT_FORKNUM.
>
> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
> However please note that having a INIT_FORKNUM does not always imply
> that a sync is wanted. copy_relation_data is an example of that.

As I wrote above, I suppose we should fix(?) the irregular
relationship between WAL and init fork of brin and so.

> > INIT_FORKNUM is generated only for unlogged tables and their
> > belongings. I suppose such successive fsyncs doesn't cause
> > observable performance drop assuming that the number of unlogged
> > tables and belongings is not so high, especially with smarter
> > storages. All we should do is that just fsync only for
> > INIT_FORKNUM's FPIs for the case. If the performance does matter
> > even so, we still can fsync the last md-file when any wal record
> > other than FPI for INIT_FORK comes. (But this would be a bit
> > complex..)
>
> Hm. If a system uses a bunch of temporary relations with brin index or
> other included I would not say so. For back branches we may have to do
> it unconditionally using INIT_FORKNUM, but having a control flag to
> have it only done for unlogged relations would leverage that.

It could, and should do so. And if we take such systems with
bunch of temp relations as significant (I agree with this),
XLogRegisterBlock() looks to be able to register multiple blocks
into single wal record and we could eliminate arbitrary flagging
on individual FPI records using it. Is it possible?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-12-01 06:22:16 Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Previous Message Amit Langote 2015-12-01 04:28:55 Re: Minor comment edits in nodeGather.c