Re: Error with index on unlogged table

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Thom Brown <thom(at)linux(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Error with index on unlogged table
Date: 2015-03-26 17:50:24
Message-ID: 20150326175024.GJ451@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-03-26 15:13:41 +0100, Andres Freund wrote:
> On 2015-03-26 13:55:22 +0000, Thom Brown wrote:
> > I still, however, have a problem with the separate and original issue of:
> >
> > # insert into utest (thing) values ('moomoo');
> > ERROR: index "utest_pkey" contains unexpected zero page at block 0
> > HINT: Please REINDEX it.
> >
> > I don't see why the user should need to go re-indexing all unlogged tables
> > each time a standby is promoted. The index should just be empty and ready
> > to use.
>
> There's definitely something rather broken here. Investigating.

As far as I can see this has been broken at least since the introduction
of fast promotion. WAL replay will update the init fork in shared
memory, but it'll not be guaranteed to be flushed to disk when the reset
happens. d3586fc8a et al. then also made it possible to hit the issue
without fast promotion.

To hit the issue there may not be a restartpoint (requiring a checkpoint
on the primary) since the creation of the unlogged table.

I think the problem here is that the *primary* makes no such
assumptions. Init forks are logged via stuff like
smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
(char *) metapage, true);
if (XLogIsNeeded())
log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
BTREE_METAPAGE, metapage, false);

/*
* An immediate sync is required even if we xlog'd the page, because the
* write did not go through shared_buffers and therefore a concurrent
* checkpoint may have moved the redo pointer past our xlog record.
*/
smgrimmedsync(index->rd_smgr, INIT_FORKNUM);

i.e. the data is written out directly to disk, circumventing
shared_buffers. It's pretty bad that we don't do the same on the
standby. For master I think we should just add a bit to the XLOG_FPI
record saying the data should be forced out to disk. I'm less sure
what's to be done in the back branches. Flushing every HEAP_NEWPAGE
record isn't really an option.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-03-26 18:00:20 Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
Previous Message Heikki Linnakangas 2015-03-26 17:16:21 Re: Index-only scans for GiST.