Re: PANIC in GIN code

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Heikki <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PANIC in GIN code
Date: 2015-06-29 16:20:11
Message-ID: CAMkU=1zUZPhY+Dt6dy3YNqX8384RFk2Rj71bUm2_Nbz9wCG56w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 29, 2015 at 1:37 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 06/29/2015 01:12 AM, Jeff Janes wrote:
>
>> Now I'm getting a different error, with or without checksums.
>>
>> ERROR: invalid page in block 0 of relation base/16384/16420
>> CONTEXT: automatic vacuum of table "jjanes.public.foo"
>>
>> 16420 is the gin index. I can't even get the page with pageinspect:
>>
>> jjanes=# SELECT * FROM get_raw_page('foo_text_array_idx', 0);
>> ERROR: invalid page in block 0 of relation base/16384/16420
>>
>> This is the last few gin entries from pg_xlogdump
>>
>>
>> rmgr: Gin len (rec/tot): 0/ 3893, tx: 0, lsn:
>> 0/77270E90, prev 0/77270E68, desc: VACUUM_PAGE , blkref #0: rel
>> 1663/16384/16420 blk 27 FPW
>> rmgr: Gin len (rec/tot): 0/ 3013, tx: 0, lsn:
>> 0/77272080, prev 0/77272058, desc: VACUUM_PAGE , blkref #0: rel
>> 1663/16384/16420 blk 6904 FPW
>> rmgr: Gin len (rec/tot): 0/ 3093, tx: 0, lsn:
>> 0/77272E08, prev 0/77272DE0, desc: VACUUM_PAGE , blkref #0: rel
>> 1663/16384/16420 blk 1257 FPW
>> rmgr: Gin len (rec/tot): 8/ 4662, tx: 318119897, lsn:
>> 0/77A2CF10, prev 0/77A2CEC8, desc: INSERT_LISTPAGE , blkref #0: rel
>> 1663/16384/16420 blk 22184
>> rmgr: Gin len (rec/tot): 88/ 134, tx: 318119897, lsn:
>> 0/77A2E188, prev 0/77A2E160, desc: UPDATE_META_PAGE , blkref #0: rel
>> 1663/16384/16420 blk 0
>>
>
Another piece of info here that might be relevant. Almost all
UPDATE_META_PAGE xlog records other than the last one have two backup
blocks. The last UPDATE_META_PAGE record only has one backup block.

And the metapage is mostly zeros:
>>
>> head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od
>> 0000000 000000 000000 161020 073642 000000 000000 000000 000000
>> 0000020 000000 000000 000000 000000 053250 000000 053250 000000
>> 0000040 006140 000000 000001 000000 000001 000000 000000 000000
>> 0000060 031215 000000 000452 000000 000000 000000 000000 000000
>> 0000100 025370 000000 000000 000000 000002 000000 000000 000000
>> 0000120 000000 000000 000000 000000 000000 000000 000000 000000
>> *
>> 0020000
>>
>
> Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it
> doesn't initialize the page. It copies the metapage data, but it doesn't
> touch the page headers. The only way I can see that that would cause
> trouble is if the index somehow got truncated away or removed in the
> standby. That could happen in crash recovery, if you drop the index and the
> crash, but that should be harmless, because crash recovery doesn't try to
> read the metapage, only update it (by overwriting it), and by the time
> crash recovery has completed, the index drop is replayed too.
>
> But AFAICS that bug is present in earlier versions too.

Yes, I did see this error reported previously but it was always after the
first appearance of the PANIC, so I assumed it was a sequella to that and
didn't investigate it further at that time.

> Can you reproduce this easily? How?

I can reproduce it fairly easy.

I apply the attached patch and compile with enable-casssert (full list
'--enable-debug' '--with-libxml' '--with-perl' '--with-python'
'--with-ldap' '--with-openssl' '--with-gssapi'
'--prefix=/home/jjanes/pgsql/torn_bisect/' '--enable-cassert')

Then edit do.sh to point to the data directory and installation directory
you want, and run that. It calls count.pl from the same directory. I
started getting the errors after about 10 minutes on a 8 core Intel(R)
Xeon(R) CPU E5-2650 0 @ 2.00GHz.

sh do.sh >& do_cassert_fix.out2 &

The output is quite a mess, mingling logfile from PostgreSQL and from Perl
together. Since I already know what I'm looking for, I use:

tail -f do_cassert_fix.out2 |fgrep ERROR

Cheers,

Jeff

Attachment Content-Type Size
do.sh text/x-sh 4.3 KB
count.pl text/x-perl 8.3 KB
crash_REL9_5.patch text/x-diff 11.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-06-29 17:02:10 Re: Reduce ProcArrayLock contention
Previous Message Tatsuo Ishii 2015-06-29 15:39:56 Re: Oh, this is embarrassing: init file logic is still broken