Re: Enum proposal / design

From: andrew(at)dunslane(dot)net
To: "Tom Dunstan" <pgsql(at)tomd(dot)cc>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enum proposal / design
Date: 2006-08-16 07:37:33
Message-ID: 1328.24.211.165.134.1155713853.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We forgot to mention that we'll need to implement domains over enums and
arrays of enums too.

cheers

andrew

Tom Dunstan wrote:
> Hi guys
>
> Andrew and I got together and worked out a more detailed idea of how we
> want to add enums to the postgresql core. This follows on from his
> original enumkit prototype last year [1]. Here's a more formal proposal
> / design with what we came up with. Comments / criticism hereby solicited.
>
>
> How they will work (once created) is more or less the same as last time
> with the enumkit, with the exception of how they're created.
>
> Enum types will be created with a specialised version of the CREATE TYPE
> command thusly:
>
> CREATE TYPE rgb AS ENUM ('red', 'green', 'blue');
>
> They can then be used as column types, being input in quoted string form
> as with other user types:
>
> CREATE TABLE enumtest (col rgb);
> INSERT INTO enumtest VALUES ('red');
>
> Input is to be case sensitive, and ordering is to be in the definition
> order, not the collation order of the text values (ie 'red' < 'green' in
> the example above). See the original thread for more discussion and
> usage examples.
>
>
> The implementation will work as below. I've included something of a list
> of stuff to do as well.
>
> On disk, enums will occupy 4 bytes: the high 22 bits will be an enum
> identifier, with the bottom 10 bits being the enum value. This allows
> 1024 values for a given enum, and 2^22 different enum types, both of
> which should be heaps. The exact distribution of bits doesn't matter all
> that much, we just picked some that we were comfortable with.
>
> The identifier is required as output functions are not fed information
> about which exact type they are being asked to format (see below).
>
> The creation of a new pg_enum catalog is required. This will hold:
> - the type OID for the enum, from pg_type
> - the enum identifier for on disk storage
> - the enum values in definition order, as an array of text values
>
> The CREATE TYPE command will create a row in pg_type and a row in
> pg_enum. We will get a new enum id by scanning pg_enum and looking for
> the first unused value, rather than using a sequence, to make reuse of
> enum ids more predictable.
>
> Two new syscaches on pg_enum will be created to simplify lookup in the
> i/o functions: one indexed by type oid for the input function, and one
> indexed by enum id for the output function.
>
> All functions will be builtins; there will be no duplicate entries of
> them in pg_proc as was required for the enumkit.
>
> The i/o functions will both cache enum info in the same way that the
> domain and composite type i/o functions do, by attaching the data to the
> fcinfo->flinfo->fn_extra pointer. The input function will look up the
> enum data in the syscache using the type oid that it will be passed, and
> cache it in a hashtable or binary tree for easy repeated lookup. The
> output function will look up the enum data in the syscache using the
> enum id stripped from the high 22 bits of the on-disk value and cache
> the data as a straight array for easy access, with the enum value being
> used as a index into the array.
>
> The other functions will all work pretty much like they did in the
> enumkit, with comparison operators more or less treating the enum as its
> integer representation.
>
> The grammar will have to be extended to support the new CREATE TYPE
> syntax. This should not require making ENUM a reserved word. Likewise
> psql will be extended to learn the new grammar. There's probably a bit
> of work to do in DROP TYPE to make sure it deletes rows from pg_enum
> when appropriate.
>
> pg_dump must be taught how to dump enums properly.
>
> We'll need some regression tests, maybe including one in one of the PL
> testsuites to ensure that the io functions work happily when called from
> a non-standard direction.
>
> Documentation etc.
>
>
> General discussion:
>
> While we would really like to have had a 2 byte representation on disk
> (or even 1 for most cases), with the stored value being *just* the enum
> ordinal and not containing any type info about the enum type itself,
> this is difficult. Since the output function cleanup [2] [3], postgresql
> doesn't pass through the expected output type to output functions. This
> makes it difficult to tell the difference between e.g. the first value
> of the various enums, which would all have an integer representation of
> 0. We could have gone down the path of having the output function look
> up its expected type from the fcinfo->flinfo struct, as Martijn's tagged
> types do [4], but that would have required extra entries in pg_proc for
> every single enum. Alternatively we could have stored the full enum type
> oid on disk, but that would have blown out the on-disk representation to
> 5 or 6 bytes. The given approach of having a smaller enum id and the
> enum ordinal value stored in the 4 bytes seems a reasonable tradeoff
> given the current constraints.
>
> To preempt some questions (particularly some which came up in the
> enumkit discussion), here's a list of stuff which will *not* be
> implemented in the initial patch (and quite possibly never):
>
> - Support for ALTER TYPE to allow adding / modifying values etc. For
> the time being you'll just have to create a new type, do a bunch of
> ALTER TABLE commands, DROP the old type and rename the new one if you
> want the old name back.
>
> - Inline column enum declarations a la MySQL. While this feature might
> allow easier migration from MySQL, and we could theoretically do it by
> creating an anonymous type when creating the table, the cleanup when the
> column/table are dropped is a real problem, and pg_dump has to get a lot
> smarter. Given the ugliness of suppporting something similar with SERIAL
> columns [5], this is definitely not on the cards anytime soon.
>
> - Ordering by text value rather than the declaration order. If you
> want this, you really want a varchar domain instead. Or alternately you
> can order by e.g. colname::text if that does what you want. Doing
> something like that sounds suspiciously like ordering something for
> human consumption, though, which sounds like a really fast way to make
> your application difficult to localize. Anyway, if that's the only
> ordering you'll ever want, just define the values in alphabetical order.
> :)
>
> - Access to the internal integer representation. If you need to modify
> the values used or want to know what the integer is, use a lookup table
> instead. Enums are the wrong abstraction for you.
>
>
> Comments? Particularly on implementation strategy; the functionality was
> thrashed out pretty well last time around.
>
> Cheers
>
> Tom "unholy chimera" Dunstan
>
>
> [1] http://archives.postgresql.org/pgsql-hackers/2005-10/msg01243.php
> [2] http://archives.postgresql.org/pgsql-hackers/2005-04/msg00998.php
> [3] http://archives.postgresql.org/pgsql-hackers/2005-12/msg00454.php
> [4] http://svana.org/kleptog/pgsql/taggedtypes.html
> [5] http://archives.postgresql.org/pgsql-hackers/2006-04/msg01118.php
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jens-Wolfhard Schicke 2006-08-16 07:48:37 Re: insert/update/delete returning and rules
Previous Message Jie Zhang 2006-08-16 07:03:26 Re: [PATCHES] WIP: bitmap indexes