Re: Error with index on unlogged table

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Error with index on unlogged table
Date: 2015-12-08 13:09:54
Message-ID: 20151208130954.GT4934@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> Let's go for XLOG_FPI_FLUSH.
> >
> > I think the other way is a bit better, because we can add new flags
> > without changing the WAL format.
>
> Hm. On the contrary, I think that it would make more sense to have a
> flag as well for FOR_HINT honestly, those are really the same
> operations, and FOR_HINT is just here for statistic purposes.

I don't think it's all that much the same operation. And WRT statistics
purpose: Being able to easily differentiate FOR_HINT is important for
pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

> >> --- a/src/backend/access/transam/xloginsert.c
> >> +++ b/src/backend/access/transam/xloginsert.c
> >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
> >> * If the page follows the standard page layout, with a PageHeader and unused
> >> * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
> >> * the unused space to be left out from the WAL record, making it smaller.
> >> + *
> >> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> >> + * immediately its data at replay after replaying this full page image.
> >> */
> >
> > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> > OS immediately after replaying the record'?
>
> s/OS/stable storage?

It's not flushed to stable storage here. Just to the OS.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..fff48ab 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
> brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
> BRIN_CURRENT_VERSION);
> MarkBufferDirty(metabuf);
> - log_newpage_buffer(metabuf, false);
> +
> + /*
> + * For unlogged relations, this page should be immediately flushed
> + * to disk after being replayed. This is necessary to ensure that the
> + * initial on-disk state of unlogged relations is preserved as the
> + * on-disk files are copied directly at the end of recovery.
> + */
> + log_newpage_buffer(metabuf, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
> END_CRIT_SECTION();
>
> UnlockReleaseBuffer(metabuf);
> diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
> index f876f62..572fe20 100644
> --- a/src/backend/access/brin/brin_pageops.c
> +++ b/src/backend/access/brin/brin_pageops.c
> @@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
> page = BufferGetPage(buffer);
> brin_page_init(page, BRIN_PAGETYPE_REGULAR);
> MarkBufferDirty(buffer);
> - log_newpage_buffer(buffer, true);
> + log_newpage_buffer(buffer, true, false);
> END_CRIT_SECTION();
>
> /*
> diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
> index 49e9185..17c168a 100644
> --- a/src/backend/access/gin/gininsert.c
> +++ b/src/backend/access/gin/gininsert.c
> @@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS)
> ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
> LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
>
> - /* Initialize and xlog metabuffer and root buffer. */
> + /*
> + * Initialize and xlog metabuffer and root buffer. For unlogged
> + * relations, those pages need to be immediately flushed to disk
> + * after being replayed. This is necessary to ensure that the
> + * initial on-disk state of unlogged relations is preserved when
> + * they get reset at the end of recovery.
> + */
> START_CRIT_SECTION();
> GinInitMetabuffer(MetaBuffer);
> MarkBufferDirty(MetaBuffer);
> - log_newpage_buffer(MetaBuffer, false);
> + log_newpage_buffer(MetaBuffer, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
> GinInitBuffer(RootBuffer, GIN_LEAF);
> MarkBufferDirty(RootBuffer);
> - log_newpage_buffer(RootBuffer, false);
> + log_newpage_buffer(RootBuffer, false,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
> END_CRIT_SECTION();
>
> /* Unlock and release the buffers. */
> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index 53bccf6..6a20031 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS)
> START_CRIT_SECTION();
> GISTInitBuffer(buffer, F_LEAF);
> MarkBufferDirty(buffer);
> - log_newpage_buffer(buffer, true);
> + log_newpage_buffer(buffer, true,
> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
> END_CRIT_SECTION();
>

I might be missing something here - but isn't it pretty much guaranteed
that all these are unlogged relations? Otherwise *buildempty() wouldn't
have been called, right?

> else if (info == XLOG_BACKUP_END)
> {
> @@ -178,9 +183,6 @@ xlog_identify(uint8 info)
> case XLOG_FPI:
> id = "FPI";
> break;
> - case XLOG_FPI_FOR_HINT:
> - id = "FPI_FOR_HINT";
> - break;
> }

Really don't want to do that.

> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
> * resource manager needs to generate conflicts, it has to define a
> * separate WAL record type and redo routine.
> *
> - * XLOG_FPI_FOR_HINT records are generated when a page needs to be
> - * WAL- logged because of a hint bit update. They are only generated
> - * when checksums are enabled. There is no difference in handling
> - * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
> - * code just to distinguish them for statistics purposes.
> + * Records flagged with 'for_hint_bits' are generated when a page needs
> + * to be WAL- logged because of a hint bit update. They are only
> + * generated when checksums are enabled. There is no difference in
> + * handling records when this flag is set, it is used for statistics
> + * purposes.
> + *
> + * Records flagged with 'is_flush' indicate that the page immediately
> + * needs to be written to disk, not just to shared buffers. This is
> + * important if the on-disk state is to be the authoritative, not the
> + * state in shared buffers. E.g. because on-disk files may later be
> + * copied directly.
> */
> if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED)
> elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
> +
> + if (xlrec.is_flush)
> + {
> + RelFileNode rnode;
> + ForkNumber forknum;
> + BlockNumber blkno;
> + SMgrRelation srel;
> +
> + (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno);
> + srel = smgropen(rnode, InvalidBackendId);
> + smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false);
> + smgrclose(srel);
> + }

That'd leave the dirty bit set on the buffer...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2015-12-08 13:32:31 Tab-comletion for RLS
Previous Message Andres Freund 2015-12-08 12:57:16 Re: Error with index on unlogged table