Re: WAL format and API changes (9.5)

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-07-02 06:01:02
Message-ID: CAB7nPqToR=cHmxMf9qPsW8_d8MjDXiz1yaoFf5X_Z7+Ppvm_Dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 1, 2014 at 7:18 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > Thanks for the review!
> >
> >
> > On 06/27/2014 09:12 AM, Michael Paquier wrote:
> >>
> >> Looking at this patch, it does not compile when assertions are enabled
> >> because of this assertion in heapam.c:
> >> Assert(recdata == data + datalen);
> >> It seems like a thinko as removing it makes the code compile.
> >
> >
> > Fixed.
> >
> >
> >> Some comments about the code:
> >> 1) In src/backend/access/transam/README:
> >> 's/no further action is no required/no further action is required/g'
> >
> >
> > Fixed.
> >
> >
> >> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
> >> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
> >
> >
> > Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> > what to do with the latter two. They're not really intended for use by
> redo
> > routines, but for things like pg_xlogdump that want to extract
> information
> > about any record type.
> >
> >
> >> 3) There are a couple of code blocks marked as FIXME and BROKEN. There
> are
> >> as well some NOT_USED blocks.
> >
> >
> > Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly
> > stand for code that used to print block numbers or relfilenodes, which
> > doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> > routines anymore. Will need to go through them all an decide what's the
> > important information to print for each record type.
> >
> > The few NOT_USED blocks are around code in redo routines that extract
> some
> > information from the WAL record, that isn't needed anymore. We could
> remove
> > the fields from the WAL records altogether, but might be useful to keep
> for
> > debugging purposes.
> >
> >
> >> 4) Current patch crashes when running make check in
> >> contrib/test_decoding.
> >> Looking closer, wal_level = logical is broken:
> >> =# show wal_level;
> >> wal_level
> >> -----------
> >> logical
> >> (1 row)
> >> =# create database foo;
> >> The connection to the server was lost. Attempting reset: Failed.
> >
> >
> > Fixed.
> >
> >
> >> Looking even closer, log_heap_new_cid is broken:
> >
> >
> > Fixed.
> >
> >
> >> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
> >> the static XLogRecData entries if buffer_id >= 0 or to
> >> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to
> the
> >> WAL record). XLogRegisterXLogRecData does something similar, but with an
> >> entry of XLogRecData already built. What do you think about removing
> >> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes
> the
> >> interface simpler, and XLogRecData could be kept private in xlog.c. This
> >> would need more work especially on gin side where I am seeing for
> example
> >> constructLeafRecompressWALData directly building a XLogRecData entry.
> >
> >
> > Hmm. I considered that, but punted because I didn't want to rewrite all
> the
> > places that currently use XLogRecDatas in less straightforward ways. I
> kept
> > XLogRegisterXLogRecData as an escape hatch for those.
> >
> > But I bit the bullet now, and got rid of all the
> XLogRegisterXLogRecData()
> > calls. XLogRecData struct is now completely private to xlog.c.
> >
> > Another big change in the attached patch version is that
> XLogRegisterData()
> > now *copies* the data to a working area, instead of merely adding a
> > XLogRecData entry to the chain. This simplifies things in xlog.c, now
> that
> > the XLogRecData concept is not visible to the rest of the system. This
> is a
> > subtle semantic change for the callers: you can no longer register a
> struct
> > first, and fill the fields afterwards.
> >
> > That adds a lot of memcpys to the critical path, which could be bad for
> > performance. In practice, I think it's OK, or even a win, because a
> typical
> > WAL record payload is very small. But I haven't measured that. If it
> becomes
> > an issue, we could try to optimize, and link larger data blocks to the
> rdata
> > chain, but memcpy small small chunks of data. There are a few WAL record
> > types that don't fit in a small statically allocated buffer, so I
> provided a
> > new function, XLogEnsureSpace(), that can be called before XLogBegin() to
> > allocate a large-enough working area.
> >
> > These changes required changing a couple of places where WAL records are
> > created:
> >
> > * ginPlaceToPage. This function has a strange split of responsibility
> > between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage
> > calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a
> tree
> > it is, and dataPlaceToPage or entryPlaceToPage contributes some data to
> the
> > WAL record. Then ginPlaceToPage adds some more, and finally calls
> > XLogInsert(). I simplified the SPLIT case so that we now just create full
> > page images of both page halves. We were already logging all the data on
> > both page halves, but we were doing it in a "smarter" way, knowing what
> kind
> > of pages they were. For example, for an entry tree page, we included an
> > IndexTuple struct for every tuple on the page, but didn't include the
> item
> > pointers. A straight full page image takes more space, but not much, so I
> > think in any real application it's going to be lost in noise. (again,
> > haven't measured it though)
> >
> >
> >> 8) What do you think about adding in the README a description about how
> to
> >> retrieve the block list modified by a WAL record, for a flow similar to
> >> that:
> >> record = XLogReadRecord(...);
> >> nblocks = XLogGetBlockRefIds(record, blockids);
> >> for (i = 0; i < nblocks; i++)
> >> {
> >> bkpb = XLogGetBlockRef(record, id, NULL);
> >> // stuff
> >> }
> >> pg_xlogdump is a good example of what can be done as well.
> >
> >
> > Yeah, that's probably in order. I don't actually like the way that works
> > currently very much. Passing the char[256] blockids array to the function
> > feels weird. Will have to think a bit harder about that..
> >
> >
> >> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
> >> + char *payload = XLogRecGetData(record) +
> >> sizeof(ginxlogInsert);
> >> +#ifdef NOT_USED
> >> leftChildBlkno = BlockIdGetBlockNumber((BlockId)
> >> payload);
> >> +#endif
> >
> >
> > No. XLogRecGetData() is used to get data from the "main chunk", while
> > XLogGetPayload() is used to get data associated with a buffer. This does
> the
> > former.
> >
> >
> >> 10) Shouldn't the call to XLogBeginInsert come first here?
> >> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
> >> *state,
> >> {
> >> XLogRecPtr recptr;
> >>
> >> - recptr = XLogInsert(RM_SPGIST_ID,
> >> XLOG_SPGIST_ADD_NODE, rdata);
> >> + XLogRegisterBuffer(0, saveCurrent.buffer,
> >> REGBUF_STANDARD); /* orig page */
> >> + XLogRegisterBuffer(1, current->buffer,
> >> REGBUF_STANDARD); /* new page */
> >> + if (xlrec.parentBlk == 2)
> >> + XLogRegisterBuffer(2, parent->buffer,
> >> REGBUF_STANDARD);
> >> +
> >> + XLogBeginInsert();
> >> + XLogRegisterData(-1, (char *) &xlrec,
> >> sizeof(xlrec));
> >> + XLogRegisterData(-1, (char *) newInnerTuple,
> >> newInnerTuple->size);
> >> +
> >> + recptr = XLogInsert(RM_SPGIST_ID,
> >> XLOG_SPGIST_ADD_NODE);
> >
> >
> > Yeah. Apparently we have no regression test coverage of this codepath,
> or it
> > would've triggered an assertion. Fixed now, just by looking at the code,
> but
> > there's still no test coverage of this so it's likely still broken in
> other
> > ways. Before this is committed, I think we'll need to go through the
> > coverage report and make sure all the XLogInsert() codepaths (and their
> redo
> > routines) are covered.
> >
> >
> >> 11) Any reason for raising a PANIC here?
> >> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
> >> break;
> >> }
> >> if (i < 0)
> >> - elog(ERROR, "lock %s %d is not held", T_NAME(l),
> T_ID(l));
> >> + elog(PANIC, "lock %s %d is not held", T_NAME(l),
> T_ID(l));
> >> num_held_lwlocks--;
> >> for (; i < num_held_lwlocks; i++)
> >
> >
> > No, leftover from debugging. Fixed.
> >
> >
> >> 14) SPgist redo is crashing (during make installcheck run on a master,
> >> redo
> >> done on a standby at recovery).
> >> (lldb) bt
> >> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
> >> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
> >> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill
> +
> >> 10
> >> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
> >> 92
> >> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
> >> frame #3: 0x000000010ba8d819
> >> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
> >> errorType=0x000000010baf019f, fileName=0x000000010bb212dd,
> lineNumber=65)
> >> +
> >> 137 at assert.c:54
> >> frame #4: 0x000000010b5c3a50
> >> postgres`addOrReplaceTuple(page=0x000000010bfee980,
> >> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
> >> frame #5: 0x000000010b5c031f
> postgres`spgRedoMoveLeafs(lsn=71708608,
> >> record=0x00007ffd41822800) + 719 at spgxlog.c:263
> >> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
> >> record=0x00007ffd41822800) + 402 at spgxlog.c:988
> >> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
> >> xlog.c:6780
> >> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
> >> startup.c:224
> >> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
> >> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
> >> frame #10: 0x000000010b864998
> >> postgres`StartChildProcess(type=StartupProcess) + 328 at
> postmaster.c:5090
> >> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
> >> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
> >> frame #12: 0x000000010b7a9335 postgres`main(argc=3,
> >> argv=0x00007ffd40c04470) + 773 at main.c:227
> >
> >
> > Fixed.
> >
> >
> >> Also, I wanted to run the WAL replay facility to compare before and
> after
> >> buffer images, but code crashed before... I am still planning to do so
> >> once
> >> this code gets more stable though.
> >
> >
> > I did that earlier, and did it now again. I got a single difference from
> a
> > sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have
> to
> > investigate it later.
>
> With the patch, when I ran the simple test which generates lots of
> WAL, I got PANIC error.
>
> =# create extension pgcrypto ;
> CREATE EXTENSION
> =# create extension pg_trgm ;
> CREATE EXTENSION
> =# create table hoge (col1 text);
> CREATE TABLE
> =# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
> CREATE INDEX
> =# insert into hoge select gen_random_uuid() from
> generate_series(1,10000000);
> PANIC: too many registered buffers
> PANIC: too many registered buffers
> The connection to the server was lost. Attempting reset: Failed.
>
Yep, this is easily reproducible. I imagine that a call to
XLogEnsureRecordSpace before START_CRIT_SECTION() is missing in the gin
code path.
--
Michael

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2014-07-02 06:08:31 Re: 9.5 CF1
Previous Message Tom Lane 2014-07-02 05:55:14 Re: Array of composite types returned from python