Re: Problem with PITR recovery

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Rob Butler <crodster2k(at)yahoo(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Jeff Davis <jdavis-pgsql(at)empires(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Problem with PITR recovery
Date: 2005-04-20 17:38:50
Message-ID: 1114018730.16721.2299.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2005-04-18 at 23:20 +0100, Simon Riggs wrote:
> My plan would be to write a special xlog record for xlog switching. This
> would be a special processing instruction, rather than a data/redo
> instructions. This would be implemented as another xlog info value on
> the xlog_redo resource manager function, XLOG_FILE_SWITCH. (xlog_redo
> would simply set a variable to be used elsewhere.)
>
> When written the xlog switch instruction (XLogInsert) would switch to a
> new xlog, just as if a file had been filled, causing it to be
> immediately archived.

This has been mostly implemented and posted to PATCHES, though I have a
later patch also. There are some points still to discuss.

Setting the pointer seems to work, but there are 3 pointers, each
protected by a separate locks. All of those are designed to be taken and
held independently.

My understanding is that the correct locking order would be:

WALInsertLock
WALWriteLock
info_lck

XLogInsert uses info_lck first, but then checks everything again once it
acquires WALInsertLock. To switch files, we must ensure that nobody can
insert xlrecs with a record pointer higher than the log switch record.
This is different from checkpoints, where a checkpoint record can
actually occur before records which are logically after it; that must
never happen with a log switch else we'd miss them entirely on wal
replay.

Next, from XLogInsert with WALInsertLock held, we wait to acquire
WALWriteLock, since an I/O might be in progress currently. When we have
this, we then issue an XLogWrite, during which we update the record
pointer, which then is propogated through to info_lck.

AFAICS this is the only case of unconditionally acquiring all 3 locks.

Do we agree that this is the correct lock sequence, and if it is, do we
think that this leaves open the chance of deadlock at any stage?

> A shutdown checkpoint would also have the same effect as an
> XLOG_FILE_SWITCH instruction, so that the archiver would be able to copy
> away the file. Otherwise, we'd have a problem as to which order to write
> the messages in at shutdown time. (Not happy about that bit, so
> suggestions welcome...)

Treating shutdown checkpoint markers as xlog switches is possible but
gives problems since archive_command is a SUSET variable. On replay we
wouldn't necessarily know whether a shutdown checkpoint was treated as
an xlog switch when it was written, so we'd need to attempt to switch
and look beyond the checkpoint marker, just in case. That makes me
uncomfortable.

Hmmm...

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2005-04-20 17:59:01 Re: Bad n_distinct estimation; hacks suggested?
Previous Message Bruce Momjian 2005-04-20 17:23:02 Re: [GENERAL] Idea for the statistics collector