Re: Bug in visibility map WAL-logging

From: Matheus de Oliveira <matioli(dot)matheus(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: Bug in visibility map WAL-logging
Date: 2014-01-08 00:42:59
Message-ID: CAJghg4+8gneMTtSw7R==5b9T719mqfsCAzar2qcHpYo5odcD2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 7, 2014 at 5:36 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com>wrote:

> On 01/07/2014 07:15 PM, Matheus de Oliveira wrote:
>
>> Hi folks,
>>
>>
>> On Fri, Dec 13, 2013 at 9:47 AM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com
>>
>>> wrote:
>>>
>>
>> lazy_vacuum_page() does this:
>>>
>>> 1. START_CRIT_SECTION()
>>> 2. Remove dead tuples from the page, marking the itemid's unused.
>>> 3. MarkBufferDirty
>>> 4. if all remaining tuples on the page are visible to everyone, set the
>>> all-visible flag on the page, and call visibilitymap_set() to set the VM
>>> bit.
>>> 5 visibilitymap_set() writes a WAL record about setting the bit in the
>>> visibility map.
>>> 6. write the WAL record of removing the dead tuples.
>>> 7. END_CRIT_SECTION().
>>>
>>> See the problem? Setting the VM bit is WAL-logged first, before the
>>> removal of the tuples. If you stop recovery between the two WAL records,
>>> the page's VM bit in the VM map will be set, but the dead tuples are
>>> still
>>> on the page.
>>>
>>> This bug was introduced in Feb 2013, by commit
>>> fdf9e21196a6f58c6021c967dc5776a16190f295, so it's present in 9.3 and
>>> master.
>>>
>>> The fix seems quite straightforward, we just have to change the order of
>>> those two records. I'll go commit that.
>>>
>>>
>>
>> With a lot of help from Andres on IRC (thanks a lot man), I think (he
>> thinks actually, =P ) I may ran into the issue this bug can cause. I'm
>> sending this e-mail to (1) confirm if my issue is really caused by that
>> bug
>> and if that is the case, (2) let you guys know the problems it can cause.
>>
>> Scenario:
>>
>> Master: 9.3.1 (I know, trying to go to 9.3.2)
>> Slave: 9.3.2
>>
>> The slave was running with hot_standby=off (because of other bug Andres is
>> working on), but it stopped working with the following log lines:
>>
>> 2014-01-07 12:38:04 BRST [22668]: [7-1] user=,db= WARNING: page 1076 of
>> relation base/883916/151040222 is uninitialized
>> 2014-01-07 12:38:04 BRST [22668]: [8-1] user=,db= CONTEXT: xlog redo
>> visible: rel 1663/883916/151040222; blk 1076
>> 2014-01-07 12:38:04 BRST [22668]: [9-1] user=,db= PANIC: WAL contains
>> references to invalid pages
>> 2014-01-07 12:38:04 BRST [22668]: [10-1] user=,db= CONTEXT: xlog redo
>> visible: rel 1663/883916/151040222; blk 1076
>> 2014-01-07 12:38:04 BRST [22665]: [3-1] user=,db= LOG: startup process
>> (PID 22668) was terminated by signal 6: Aborted
>>
>> (Complete log at https://pgsql.privatepaste.com/343f3190fe).
>>
>> I used pg_xlogdump to (try to) find the block the error relates to:
>>
>> $ pg_xlogdump pg_xlog/00000001000002BC0000002B 00000001000002BC0000007F |
>> grep -n -C5 '883916/151040222; blk 1076'
>>
>> 2088220 ... Heap2 ... 24/ 192, ... lsn: 2BC/46AB8B20 ... desc: clean:
>> rel 1663/883916/151040222; blk 1073 remxid 107409880
>> 2088221 ... Heap2 ... 20/ 52, ... lsn: 2BC/46AB8BE0 ... desc: visible:
>> rel 1663/883916/151040222; blk 1074
>> 2088222 ... Heap2 ... 24/ 128, ... lsn: 2BC/46AB8C18 ... desc: clean:
>> rel 1663/883916/151040222; blk 1074 remxid 107409880
>>
> > ...
>
> Hmm. The xlogdump indeed shows that the order of 'clean' and 'visible' is
> incorrect, but I don't immediately see how that could cause the PANIC.

Well... I also didn't understand if it could cause the PANIC. If I got
right, it could only cause a visibility map bit with wrong value (e.g.
first change the bit but fails to mark the tuple, in case of a failure in
between - which seems that not happened). Is that right?

> Why is the page uninitialized in the standby? If VACUUM is removing some
> dead tuples from it, it certainly should exist and be correctly initialized.
>
>
That I don't know for sure...

> How did you set up the standby? Did you initialize it from an offline
> backup of the master's data directory, perhaps? The log shows that the
> startup took the the "crash recovery first, then start archive recovery"
> path, because there was no backup label file. In that mode, the standby
> assumes that the system is consistent after replaying all the WAL in
> pg_xlog, which is correct if you initialize from an offline backup or
> atomic filesystem snapshot, for example. But "WAL contains references to
> invalid pages" could also be a symptom of an inconsistent base backup,
> cause by incorrect backup procedure. In particular, I have to ask because
> I've seen it before: you didn't delete backup_label from the backup, did
> you?
>

Well, I cannot answer this right now, but makes all sense and is possible.

I used a script created by someone else that does pg_start_backup + rsync +
pg_stop_backup, but in fact I didn't look at this script to see if it is
doing something nasty, as removing backup_label. I'll be able to check it
tomorrow and so I'll come back to give a definitive answer.

@andres, if it is really removing backup_label it could also cause that
other issue we saw on Monday, right? (yes I did run the same script).

BTW, is there any situation that backup_label should be removed? I see no
reason for doing this, and also have not yet saw someone doing it, so I
didn't even thought that could be it.

Thank you guys for your attention.

Best regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-08 00:47:29 Re: dynamic shared memory and locks
Previous Message Andrew Gierth 2014-01-08 00:16:27 Re: WITHIN GROUP patch