Re: [PATCHES] Cleaning up unreferenced table files

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Cleaning up unreferenced table files
Date: 2005-05-10 20:29:22
Message-ID: Pine.OSF.4.61.0505102211560.368341@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Sun, 8 May 2005, Tom Lane wrote:

> While your original patch is buggy, it's at least fixable and has
> localized, limited impact. I don't think these schemes are safe
> at all --- they put a great deal more weight on the semantics of
> the filesystem than I care to do.

I'm going to try this some more, because I feel that a scheme like this
that doesn't rely on scanning pg_class and the file system would in fact
be safer.

The key is to A) obey the "WAL first" rule, and A) remember information
about file creations over a checkpoint. The problem with the my previous
suggestion was that it didn't reliably accomplish either :).

Right now we break the WAL rule because the file creation is recorded
after the file is created. And the record is not flushed.

The trivial way to fix that is to write and flush the xlog record before
actually creating the file. (for a more optimized way to do it, see end of
message). Then we could trust that there aren't any files in the data
directory that don't have a corresponding record in WAL.

But that's not enough. If a checkpoint occurs after the file is
created, but before the transaction ends, WAL replay doesn't see the file
creation record. That's why we need a mechanism to carry the information
over the checkpoint.

We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file
creation WAL record, it also sends a message to the bgwriter that says
"I'm xid 1234, and I have just created file foobar/1234" (while holding
CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it
keeps a list of pending fsync operations. On checkpoint, the checkpointer
scans the list and removes entries for transactions that have already
ended, and attaches the remaining list to the checkpoint record.

WAL replay would start with the xid/file list in the checkpoint record,
and update it during the replay whenever a file creation or a transaction
commit/rollback record is seen. On a rollback record, files created by
that transaction are deleted. At the end of WAL replay, the files that are
left in the list belong to transactions that implicitly aborted, and can
be deleted.

If we don't want to extend the checkpoint record, a separate WAL record
works too.

Now, the more optimized way to do A:

Delay the actual file creation until it's first written to. The write
needs to be WAL logged anyway, so we would just piggyback on that.

Implemented this way, I don't think there would be a significant
performance hit from the scheme. We would create more ForwardFsyncRequest
traffic, but not much compared to the block fsync requests we have right
now.

BTW: If we allowed mdopen to create the file if it doesn't exist already,
would we need the current file creation xlog record for anything? (I'm
not suggesting to do that, just trying to get more insight)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-05-10 20:55:45 Re: [PATCHES] Cleaning up unreferenced table files
Previous Message David Walker 2005-05-10 20:26:18 Re: Can we get patents?

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2005-05-10 20:55:45 Re: [PATCHES] Cleaning up unreferenced table files
Previous Message Neil Conway 2005-05-10 06:49:28 Re: cleanup: remove MemSet() casts