Re: crash-safe visibility map, take four

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: 高增琦 <pgf00a(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jesper Krogh <jesper(at)krogh(dot)cc>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: crash-safe visibility map, take four
Date: 2011-03-31 10:31:24
Message-ID: 4D9457FC.6010700@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31.03.2011 11:33, 高增琦 wrote:
> Consider a example:
> 1. delete on two pages, emits two log (1, page1, vm_clear_1), (2, page2,
> vm_clear_2)
> 2. "vm_clear_1" and "vm_clear_2" on same vm page
> 3. checkpoint, and vm page get torned, vm_clear_2 was lost
> 4. delete another page, emits one log (3, page1, vm_clear_3), vm_clear_3
> still on that vm page
> 5. power down
> 6. startup, redo will replay all change after checkpoint, but vm_clear_2
> will never be cleared
> Am I right?

No. A page can only be torn at a hard crash, ie. at step 5. A checkpoint
flushes all changes to disk, once the checkpoint finishes all the
changes before it are safe on disk.

If you crashed between step 2 and 3, the VM page might be torn so that
only one of the vm_clears has made it to disk but the other has not. But
the WAL records for both are on disk anyway, so that will be corrected
at replay.

>> Another question:
>>> To address the problem in
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>> , should we just clear the vm before the log of insert/update/delete?
>>> This may reduce the performance, is there another solution?
>>>
>>
>> Yeah, that's a straightforward way to fix it. I don't think the performance
>> hit will be too bad. But we need to be careful not to hold locks while doing
>> I/O, which might require some rearrangement of the code. We might want to do
>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>> lock and update the heap page, and then set the VM bit while holding the
>> lock on the heap page.
>>
> Do you mean we should lock the heap page first, then get the blocknumber,
> then release heap page,
> then pin the vm's page, then lock both heap page and vm page?
> As Robert Haas said, when lock the heap page again, may there isnot enough
> free space on it.

I think the sequence would have to be:

1. Pin the heap page.
2. Check if the all-visible flag is set on the heap page (without lock).
If it is, pin the vm page
3. Lock heap page, check that it has enough free space
4. Check again if the all-visible flag is set. If it is but we didn't
pin the vm page yet, release lock and loop back to step 2
5. Update heap page
6. Update vm page

> Is there a way just stop the checkpoint for a while?

Not at the moment. It wouldn't be hard to add, though. I was about to
add a mechnism for that last autumn to fix a similar issue with b-tree
parent pointer updates
(http://archives.postgresql.org/message-id/4CCFEE61.2090702@enterprisedb.com),
but in the end it was solved differently.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-03-31 10:41:46 Re: SHMEM_INDEX_SIZE exceeded on startup
Previous Message Noah Misch 2011-03-31 10:06:49 Re: BUG #5856: pg_attribute.attinhcount is not correct.