It's past time to redo the smgr API

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: It's past time to redo the smgr API
Date: 2004-02-05 19:05:46
Message-ID: 5416.1076007946@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A long time ago Vadim proposed that we should revise smgr.c's API so
that it does not depend on Relations (relcache entries); rather, only
a RelFileNode value should be needed to access a file in smgr and lower
levels. This would allow us to get rid of the concept of "blind
writes". As of CVS tip, with most writes being done by the bgwriter,
most writes are blind. Try tracing the bgwriter: every write involves
a file open and close, which has got to be hurting its performance.
The same has been true for awhile now of the background checkpointer.

I'm motivated to do something about this today because I'm reviewing the
PITR patch that J.R. Nield and Patrick Macdonald were working on, and
one of the things I don't care for is the kluges it adds to the smgr API
to work around lack of a Relation in some operations.

I propose the following revisions:

* smgr.c should internally keep a hashtable (indexed by RelFileNode)
with "struct SMgrRelation" entries for each relation that has recently
been accessed. File access information for the lower-level modules
(md.c, fd.c) would appear in this hashtable entry.

* smgropen takes a RelFileNode and returns a pointer to a hashtable
entry, which is used as the argument to all other smgr operations.
smgropen itself does not cause any actual I/O; thus it doesn't matter
if the file doesn't exist yet. smgropen followed by smgrcreate is the
way to create a new relation on disk. Without smgrcreate, file
existence will be checked at the first read or write attempt.

* smgrclose closes the md.c-level file and drops the hashtable entry.
Hashtable entries remain valid unless explicitly closed (thus, multiple
smgropens for the same file are legal).

* The rd_fd field of relcache nodes will be replaced by an "SMgrRelation *"
pointer, so that callers having a Relation reference can access the smgr
file easily. (In particular, this means that most calls to the bufmgr
will still pass Relations, and we don't need to propagate the API change
up to the bufmgr level.)

* The existing places in the bufmgr that do RelationNodeCacheGetRelation
followed by either regular or blind write would be replaced by smgropen
followed by smgrwrite. Since smgropen is just a hash table lookup, it
is no more expensive than the RelationNodeCacheGetRelation it replaces.
Note these places would *not* do smgrclose afterwards.

* Because we don't smgrclose after a write, it is possible to have
"dangling" smgr entries that aren't useful any more, as well as open
file descriptors underneath them. This isn't too big a deal on Unix
but it will be fatal for the Windows port, since it would prevent a
DROP TABLE if some other backend happens to have touched the table.
What I propose to do about this is:
1. In the bgwriter, at each checkpoint do "smgrcloseall" to close
all open files.
2. In regular backends, receipt of a relcache flush message will
result in smgrclose(), even if there is not a relcache entry
for the given relation ID. A global cache flush event (sinval
buffer overflow) causes smgrcloseall.
This ensures that open descriptors for a dead relation will not persist
past the next checkpoint. We had already agreed, I think, that the
best solution for Windows' problem with deleting open files is to retry
pending deletes at checkpoint. This smgr rewrite will not contribute
to actually making that happen, but it won't make things worse either.

* I'm going to get rid of the "which" argument that once selected a
particular smgr implementation (md.c vs mm.c). It's vestigial at
present and is misdefined anyway. If we were to resurrect the concept
of multiple storage managers, we'd need the selection to be determinable
from the RelFileNode value, rather than being a separate argument.
(When we add tablespaces, we could imagine associating smgrs with
tablespaces, but it would have to be the responsibility of smgr.c to
determine which smgr to use based on the tablespace ID.)

* We can remove the "dummy cache" support in relcache.c, as well as
RelationNodeCacheGetRelation() and the hashtable that supports it.

* The smgr hashtable entries could possibly be used for recording other
status; for instance keeping track of which relations need fsync in the
bgwriter. I'm also thinking of merging smgr.c's existing
list-of-rels-to-be-deleted into this data structure.

* AFAICS the only downside of not having a Relation available in smgr.c
and md.c is that error messages could only refer to the RelFileNode
numbers and not to the relation name. I'm not sure this is bad, since
in my experience what you want to know about such errors is the actual
disk filename, which RelFileNode tells you and relation name doesn't.
We could preserve the current behavior by passing the relation name to
smgropen when available, and saving the name in struct SMgrRelation.
But I'm inclined not to.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2004-02-05 19:18:18 Re: Vacuum Delay feature
Previous Message Jan Wieck 2004-02-05 19:00:40 Re: Why has postmaster shutdown gotten so slow?