WIP: Range Types

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: WIP: Range Types
Date: 2011-01-04 07:29:45
Message-ID: 1294126185.18031.3395.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have been updating my work in progress here:

http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=log;h=refs/heads/rangetypes

Right now, it's not in a reviewable state, but those interested can
glance through the code.

Quick synopsis (for illustration purposes only; don't expect much from
the current code):

CREATE TYPE numrange
AS RANGE (SUBTYPE=numeric, SUBTYPE_CMP=numeric_cmp);

Much of the previous discussion seemed to focus on two issues:

1. Reconciling discrete ranges (like ranges of integers) and continuous
ranges (like ranges of numeric).

I liked Robert's suggestion here:

http://archives.postgresql.org/message-id/AANLkTiks_x93_k82b4f_ga634wCi0oeb9fTrUrF28EGM@mail.gmail.com

which says that the user can just define a "canonicalize" function that
will take a range as input (or perhaps the logical pieces of a range)
and put it into an appropriate canonical representation. For instance,
int4range_canonical might take (1,4] and turn it into [2,4]. This is
similar to a few other ideas, but Robert's idea seems to require the
least effort by the person defining the range type, because postgresql
can still handle representation.

It doesn't allow for all of the suggested features. In particular, it
would not allow "granules" to be specified for discrete ranges. But on
balance, it seems like this is the most conceptually simple and I think
it satisfies the primary use cases.

2. Representational issues. There are many possibilities here:
a. flags for inclusivity, start, and offset
b. flags for inclusivity, start, and end
c. if it's a discrete range, start and end only might suffice
d. if it's a discrete range, perhaps something involving "granules"

(a) might be interesting, and for some data types might be more compact,
but it introduces a new datatype that is distinct from the range's
subtype: the "difference type" (that is, for timestamps it's
"interval"). This approach seemed reasonable on paper, but it involves a
lot of extra complexity, and introduces some strange assumptions (using
an offset of "1 month" versus "30 days" can't be allowed).

(c) and (d) are rejected because they require different code paths for
discrete and continuous ranges.

I chose (b). This is the simplest. If desired, we could still allow the
user to specify their own serialize/deserialize functions, which can get
most of the benefits of the other ones anyway.

Other issues:

1. I plan to introduce an ANYRANGE type.

2. We need to use the subtype's IO functions, but those may not be
immutable. So, rather than create new IO functions for each range type,
I was thinking that I'd use just three (anyrange_i_in, anyrange_s_in,
and anyrange_v_in), and select the right one at definition time, based
on the subtype's IO functions' volatility. That seems like a bit of a
hack -- any better ideas?

3. Right now I allow user-defined parse/deparse functions to be
specified. In almost all cases, I would think that we want the text
format to be something like:
[ 2010-01-01, 2011-01-01 )
where the brackets denote inclusivity, and the left and right sides can
be optionally double-quoted. Is it even worth having these parse/deparse
functions, or should we just force the "obvious" format?

4. For the GiST penalty function, and perhaps some picksplit algorithms,
it might be nice to know the length of a range, or do some other kinds
of math. It introduces a lot of complexity to try to define math
functions for each subtype, and try to make sure they behave sanely. So
I was thinking that the user might need to specify a function that
converts the subtype into a float that approximates a value's position
in the total order. Any better ideas?

Overall:

I think this is one of the simpler designs. Conceptually, defining new
ranges of different granularity with ease sounds like a great idea --
but it introduces a lot of complexity (and generated a lot of different
opinions), so it was not included in this design. Similarly, I am
leaning away from lots of user-specified options unless there is a real
use case.

Any suggestions or comments welcome.

Regards,
Jeff Davis

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2011-01-04 08:29:55 Implementing pg_dump_sort.c topological sorting in sql/plpgsql/plperl?
Previous Message Itagaki Takahiro 2011-01-04 06:51:09 Re: system views for walsender activity