Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Postgresql-Hackers <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Date: 2007-08-30 17:31:23
Message-ID: 46D6FEEB.10309@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>
>>>> It seems doable, but it's not pretty. One possible scheme would be to
>>>> emit a record *after* chosing a name but *before* creating the file,
>>> No, because the way you know the name is good is a successful
>>> open(O_CREAT).
>> The idea was to log *twice*. Once the we're about to create a file, and
>> the second time that we succeeded. That way, the filename shows up in the
>> log, even if we crash immediatly after physically creating the file, which
>> gives recovery at least a chance to clean up the mess.
>
> It sounds like if the reason it fails is because someone else created the same
> file name you'll delete the wrong file?

Carefull bookkeeping during recovery should be able to eliminate that risk,
I think. I've thought a bit more like this, and came up with the following
idea that also take checkpoints into account.

We keep a global table of (xid, filename) pairs in shared memory. File creation
becomes
1) Generate a new filename
2) Add (CurrentTransactionId, filename) to the list, emit a XLOG record
saying we did this, and flush the log. If the filename is already on
the list, start over at (1).
3) Create the file. If this fails, delete the list entry and the file,
and start over at (1).
4) On (sub) transaction ABORT, we remove entries with the xids we abort,
and delete the files.
5) On top transaction COMMIT, we remove entries with the xids we commit,
and keep the files.
6) During top transaction PREPARE, we record the entries with matching xids
in the 2PC state file.

When creating a checkpoint, we include the global filelist in the checkpoint. We
might need some interlock to ensure that concurrent global filelist updates
don't get lost - but maybe doing things in the correct order is sufficient to
guarantee this.

During recovery, we track the fate of the files in a similar (but local) list.
.) We initialize our local tracking list with the one found in the latest
CHECKPOINT.
.) When we encounter a COMMIT record, we remove all files with xids matching
those in the COMMIT record without deleting them.
.) When we encounter a PREPARE record, we remove all files with matching xids,
and record them in the 2PC state file. They are deleted if the PREPARED
transaction is aborted.
.) When we encounter an ABORT record, we remove all files with matching xids
from the list, and delete them.
.) When we encounter a runtime CHECKPOINT, it's list should match our tracking
list.
.) When we encounter a shutdown CHECKPOINT, we remove all files from our local
list that are not in the checkpoint's list, and delete those files.

The XLOG flush in step (2) is pretty nasty, but I think any solution that
guarantees to prevent leaks will have to flush something to disk at that
point. The global table isn't too appealing either, because it
will limit how many concurrent transactions will be able to create files. It
could be replaced by some on-disk thing, though.

This solution sounds rather heavy-weight, but I thought I'd share the idea.

Back to work on lazy xid assignment now ;-)

greetings, Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian G. Pflug 2007-08-30 17:38:44 Re: Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment
Previous Message Ron Mayer 2007-08-30 17:26:45 Re: Why is there a tsquery data type?