Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-02-12 23:04:59
Message-ID: CAMkU=1zGWp2QnTjiyFe0VMu4gc+MoEexXYaVC2u=+ORfiYj6ow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2012 at 3:02 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>
>>> After applying this patch and then forcing crashes, upon recovery the
>>> database is not correct.
>>>
>>> If I make a table with 10,000 rows and then after that intensively
>>> update it using a unique key:
>>>
>>> update foo set count=count+1 where foobar=?
>>>
>>> Then after the crash there are less than 10,000 visible rows:
>>>
>>> select count(*) from foo
>>>
>>> This not a subtle thing, it happens every time.  I get counts of
>>> between 1973 and 8827.  Without this patch I always get exactly
>>> 10,000.
>>>
>>> I don't really know where to start on tracking this down.
>>
>> Similar problem happened on my test. When I executed CREATE TABLE and
>> shut down the server with immediate mode, after recovery I could not see the
>> created table. Here are the server log of recovery with wal_debug = on:
>>
>> LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 JST
>> LOG:  database system was not properly shut down; automatic recovery in progress
>> LOG:  redo starts at 0/179CC90
>> LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
>> XLOG - nextOid: 24576
>> LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
>> Storage - file create: base/12277/16384
>> LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
>> 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
>> LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
>> LOG:  redo done at 0/179CCE8
>>
>> According to the log "there is no contrecord flag", ISTM the path treats the
>> contrecord of backup block incorrectly, and which causes the problem.
>
> Yep, as far as I read the patch, it seems to have forgotten to set
> XLP_FIRST_IS_CONTRECORD flag.

Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
I have no idea if I did it correctly, in particular if calling
GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
effects that make that a bad thing to do. I'm not proposing it as the
real fix, I just wanted to get around this problem in order to do more
testing.

It does get rid of the "there is no contrecord flag" errors, but
recover still does not work.

Now the count of tuples in the table is always correct (I never
provoke a crash during the initial table load), but sometimes updates
to those tuples that were reported to have been committed are lost.

This is more subtle, it does not happen on every crash.

It seems that when recovery ends on "record with zero length at...",
that recovery is correct.

But when it ends on "invalid magic number 0000 in log file.." then the
recovery is screwed up.

Cheers,

Jeff

Attachment Content-Type Size
xloginsert_fix.patch application/octet-stream 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2012-02-13 02:04:28 Re: random_page_cost vs seq_page_cost
Previous Message Joshua Berkus 2012-02-12 20:33:51 3rd Cluster Hackers Summit, May 15th in Ottawa