RE: OK, OK, Hiroshi's right: use a seperately-generated filename

From: "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: RE: OK, OK, Hiroshi's right: use a seperately-generated filename
Date: 2000-06-17 09:38:53
Message-ID: EKEJJICOHDIEMGPNIFIJIEAKCCAA.Inoue@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)hub(dot)org [mailto:pgsql-hackers-owner(at)hub(dot)org]On
> Behalf Of Tom Lane
>
> After further thought I think there's a lot of merit in Hiroshi's
> opinion that physical file names should not be tied to relation OID.
> If we use a separately generated value for the file name, we can
> solve a lot of problems pretty nicely by means of "table versioning".
>
> For example: VACUUM can't compact indexes at the moment, and what it
> does do (scan the index and delete unused entries) is really slow.
> The right thing to do is for it to generate an all-new index file,
> but how do we do that without creating a risk of leaving the index
> corrupted if we crash partway through? The answer is to build the
> new index in a new physical file. But how do we install the new
> file as the real index atomically, when it might span multiple
> segments? If the physical file name is decoupled from the relation's
> name *and* OID then there is no problem: the atomic event that makes
> the new file(s) the real table contents is the commit of the new
> pg_class row with the new value for the physical filename.
>
> Aside from possible improvements in VACUUM, this would let us do a
> robust implementation of CLUSTER, and we could do the "really change
> the table" variant of ALTER TABLE DROP COLUMN the same way if anyone
> wants to do it.
>

Yes,I've wondered how do we implement column_is_really_dropped
ALTER TABLE DROP COLUMN feature without this kind of mechanism.

> The only cost is that we need an additional column in pg_class to
> hold the physical file name. That's not so bad, especially when
> you remember that we'd surely need to add something to pg_class for
> tablespace support anyway.
>
> If we bite that bullet, then we could also do something to satisfy
> Bruce about having legible file names ;-). The column in pg_class
> could perfectly well be a string, not a pure number, and that means
> that we can throw in the relname (truncated to fit of course). So
> the thing would act a lot like the original-relname-plus-OID variant
> that's been discussed so far. (Original relname because ALTER TABLE
> RENAME would *not* change the physical file name. But we could
> think about a form of VACUUM that creates a whole new table by
> versioning, and that would presumably bring the physical name back
> in sync with the logical relname.)
>
> As Hiroshi pointed out, one of the best aspects of this approach
> is that the physical table layout policy doesn't have to be hard-wired
> into low-level file access routines. The low-level routines don't
> need to know much of anything about the format of the pathname,
> they just stuff in the right segment number and use the name. The
> layout policy need only be known to one single routine that generates
> the strings that go into pg_class. So it'd be really easy to change.
>

Ross's approach is fundamentally same though he is using relname+OID
naming rule. I've said his trial is most practical one.

> One thing we'd have to work out is that the critical system tables
> (eg, pg_class itself, as well as its indexes) would have to have
> predictable physical names.

The only limitation of the relation filename is the uniqueness.
So it doesn't introduce any inconsistency that system tables
have fixed name.
As for system relations it wouldn't be so bad because CLUSTER/
ALTER TABLE DROP COLUMN ... would be unnecessary(maybe).
But as for system indexes,it is preferable that VACUUM/REINDEX
could rebuild them safely. System indexes never shrink currently.

Regards.

Hiroshi Inoue
Inoue(at)tpf(dot)co(dot)jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message D'Arcy J.M. Cain 2000-06-17 10:50:57 Re: Changes to functions and triggers
Previous Message Hiroshi Inoue 2000-06-17 09:38:29 RE: Big 7.1 open items