WORM and Read Only Tables (v0.1)

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: WORM and Read Only Tables (v0.1)
Date: 2007-12-11 11:12:46
Message-ID: 1197371566.4255.1329.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Many applications have the need to archive data after it has been
through the initial flurry of reads and updates that follows its
original insertion. Currently there is no specific feature support to
meet this requirement, so I propose to add this for 8.4.

Use Case: VLDB with tons of (now) read only data, some not. Data needs
to be accessible, but data itself is rarely touched, allowing storage
costs to be minimised via a "storage hierarchy" of progressively cheaper
storage.

Features
- Read Only Tables
- Compressed Tablespaces
- Attaching table
- Per-Tablespace Access Costing
- Performance Tuning

Read-Only Tables
----------------
Postgres supports the concept of freezing tuples, so they can live
forever within the database without needing further writes. Currently
there is no command that will guarantee that a table has been completely
frozen. This makes it difficult to reliably write data files to WORM
media for longer term archiving. (WORM means Write-Once, Read-Many).
It's also a pain having to VACUUM a large table again just because a
small number of rows need to be frozen.

So we need a DDL command that will ensure all tuples are frozen and then
mark the table as read-only. Ideally, we would like to do this in a way
that doesn't hold long full table locks, since we want the data to
remain accessible at all times.

So... VACUUM FREEZE table SET READ ONLY;

would be my first thought, but I'm guessing everybody will press me
towards supporting the more obvious

ALTER TABLE table SET READ ONLY;

This command will place a ShareLock (only) on the table, preventing
anybody from writing to the table while we freeze it. The ShareLock is
incompatible with any transaction that has written to the table, so when
we acquire the lock all writers to the table will have completed. We
then run the equivalent of a VACUUM FREEZE which will then be able to
freeze *all* rows in one pass (rather than all except the most recent).
On completion of the freeze pass we will then update the pg_class entry
to show that it is now read-only, so we will emulate the way VACUUM does
this.

This form of the ALTER TABLE command will need to be mangled so it can
only run outside of a transaction block and also so it takes only a
ShareLock rather than an AccessExclusiveLock.

Reversing the process is simpler, since we only have to turn off the
flag in pg_class:

ALTER TABLE table SET READ WRITE;

possibly able to do this without grabbing an AccessExclusiveLock, though
that isn't an important part of this implementation.

Read-only tables never need VACUUMing, so we would be able to make
autovacuum and explicit vacuum ignore them.

Read-only tables may not be written to, yet would still allow implicit
or explicit INSERT, UPDATE and DELETE privileges to be held on the
table. Attempts to write to the table will result in a specific "read
only table cannot be modified" ERROR. This allows a table to be placed
into read-only mode for long periods of time and flipped back to
read-write if some updates are required. That is useful for various
performance reasons, see later. We can't use the privilege mechanism to
prevent writes since superusers bypass them. (Thoughts?)

Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only
tables will be ignored, since they are effectively already there. So we
don't need to change the internals of the locking, nor edit the RI code
to remove the call to SHARE lock referenced tables. Do this during
post-parse analysis.

Tables can be copied to WORM media by using

ALTER TABLE table SET TABLESPACE tblspc;

This would also use a ShareLock rather than an AccessExclusiveLock,
piggy-backing off the work mentioned above.

Running SET TABLESPACE and SET READ ONLY at the same time might sound
like a good plan, but ISTM will require two fairly different code paths,
so if we do it at all it will be a later addition.

Compressed Tablespaces
----------------------

Frequently with large data archives there is a requirement to reduce the
footprint of the data to allow longer term storage costs to be reduced.

For Insert-only data we might imagine we can reduce the size of tables
by removing unused tuple header information. Although that is possible,
repeated headers compress fairly well, so it seems easier to tackle the
problem directly by having compressed tables.

Using a streaming library like zlib, it will be easy to read/write data
files into a still-usable form but with much reduced size. Access to a
compressed table only makes sense as a SeqScan. That would be handled by
introducing tablespace-specific access costs, discussed below. Indexes
on compressed tables would still be allowed, but would hardly ever be
used.

Access would probably be via tablespace-specific storage managers. So
implement mdcompress.c alongside md.c in src/backend/storage/smgr. If
that implementation route was chosen, it would then allow the
compression option to be made at tablespace level, so commands would be:

CREATE TABLESPACE tablespacename LOCATION 'directory' [COMPRESS];

(ALTER TABLESPACE support is going to be much more complex, so leave
that alone for now)

So when we copy a read-only table to another tablespace the compression
would take place without additional syntax at ALTER TABLE level. i.e.
nothing new goes in tblcmds.c. Cool.

mdcompress.c seems fairly straightforward, though we would need to think
about how to implement smgr_nblocks() since lseek-ing to get it won't
work because the file size is smaller than the actual decompressed
table. Perhaps with an info file that contains something like an index
metapage where we can read the number of blocks. Ideas?

In the first pass, I would only allow compressed read-only tables. In
time we might allow Inserts, Updates and Deletes though the latter two
will never be very effective. So my initial approach does *not* allow
writing directly to a compressed table. A couple of reasons why that
isn't desirable/easy:
If we write data straight to the table then any aborted loads will be
written to the table, so VACUUMing the table would need to re-compress
the table which sounds horrible. Plus hint status bits would need to be
set *after* the data was written. (Perhaps in conjunction with a
visibility map we can just forget that aspect of it). The other problem
is that blocks would need to be written out of shared buffers
sequentially, which unfortunately we do not guarantee. We can force that
in the smgr layer by keeping track of last written blockid and then
writing all cached blocks up the currently requested one, but that seems
problematic. I'd say if we want more, we do that in the next release, or
at least the next phase of development. So we would prevent the direct
use of CREATE TABLE on a COMPRESSED tablespace, for now.

I'm open to arguments that we don't need this at all because filesystem
utilities exist that do everything we need. You're experience will be
good to hear about in regard to this feature.

Attach
------

Writing tables on one system and then moving that data to other systems
is fairly common. If we supported read-only tables then you might
consider how you would publish new versions to people.

For now, we just want to consider how we will upgrade from one release
to another without needing to unload and reload potentially many
Terabytes of data. We can't delete the old data until the new data is
successfully loaded, so we will have a huge temporary storage cost. This
could go very slowly if we use cheaper storage, plus reloading the data
means we have to re-freeze it again also.

So we need a way of attaching the old tables to the new database. We
might call this binary upgrade, or we might be slightly less ambitious
and talk about just moving the old read-only data. That's all I want to
do at this stage.

I'm mentioning this here now to see what comes out in debate, and what
others are planning to work on in this area.

Per-Tablespace Access Costs
---------------------------

Access costs are currently set using sequential_page_cost and
random_page_cost for the whole database.

Read Only Tables and WORM media would make it more sensible to consider
different access costs for each tablespace. This then allows tablespaces
on tape, MAID, CD-ROM or just fast and slow disk etc to be catered for.

For compressed filesystems, setting random_page_cost = ~100000 should do
the trick for most planning issues.

A new parameter, startup_page_cost might also be introduced, so we can
allow for some kinds of media that have non-zero initial access times.
MAID storage ~1 sec to first block, whether sequential or random. Tape
storage can be ~15 seconds for robots, more for humans, even more for
outsourced offsite archival companies.

Performance
-----------

There are a number of optimisations that are possible with read-only
tables. I'm not suggesting to write all of them as part of this initial
project, but it is useful to discuss them at the beginning.

- Removal of empty read-only tables from query plans (particularly
interesting for optimising partitioning; nice idea Greg)
- Column stats can hold min & max values for all columns, allowing min()
and max() aggregates to be answered from the catalog, plus those values
used for constraint exclusion during planning.
- pg_class.reltuples to be used directly for select count(*)
- Table contents can be cached, allowing RI checks against small
read-only tables to operate in a similar manner to CHECK constraints
- Index-only plans would be able to trust the complete visibility of any
index entry, which would speed up RI checks against large read only
table

The first two are certainly related to advanced partitioning, so I would
be looking to work on those eventually.

Example
--------

When user tables get old we stamp them read only and copy them away to a
compressed tablespace, which we create here also to complete the
example.

VACUUM FREEZE mytable SET READ ONLY;
CREATE TABLESPACE archive001 LOCATION '/vldb_archive/arc001' COMPRESS;
ALTER TABLE mytable SET TABLESPACE archive001;

Implementation Schedule
-----------------------

Roughly in the order listed above.

I'm being partially sponsored for this work, but not fully. (Don't ask
who by, I'm under NDA). If you're a likely user of these features and
would like to sponsor me please drop me an email.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Akinde 2007-12-11 11:30:43 Re: VACUUM ANALYZE out of memory
Previous Message Simon Riggs 2007-12-11 11:11:48 VLDB Features