Range Types, discrete and/or continuous

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Range Types, discrete and/or continuous
Date: 2010-10-24 21:37:36
Message-ID: 1287956256.8516.792.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'd like to open the discussion for Range Types again. This is a fairly
long email because several issues are intertwined, and I don't think
they can be neatly pulled apart. Previous discussions:

http://archives.postgresql.org/pgsql-hackers/2009-12/msg01162.php
http://archives.postgresql.org/pgsql-hackers/2010-10/msg01126.php

Wiki page:

http://wiki.postgresql.org/wiki/RangeTypes

Last development cycle, one of the questions that was unresolved was
whether to handle ranges like a discrete set (that is, [1,5) = [1,4] )
or continuous or both.

I think that discrete ranges are required. For instance, day range and
IP address ranges are two examples where treating them continuously
would clearly cause confusion. "Monday until Thursday" is the same as
"Monday through Wednesday," and it would be dangerous to treat them as
different values.

Timestamp ranges are a little trickier because the resolution is much
finer. There's a certain intuition for either representation:
* We could look at them as continuous, and say that the resolution is
merely an implementation detail.
* We could look at them as a discrete range, saying that it just
happens to have a very fine granularity.

I don't think intuition helps us much here, because different people
have different intuitions for different problems. The second seems more
mathematically sound to me, because the resolution will ultimately
affect the semantics at some point (whether it's seconds or days), so it
can't be purely an implementation detail.

For instance, let's say you have two ranges:

R1 = [ 2009-01-01 01:00:00, 2009-01-01 01:00:10 ]
R2 = [ 2009-01-01 01:00:00, 2009-01-01 01:00:10.000001 )

If we treat those as discrete, then R1 = R2, R1 contains R2, R2 contains
R1, and R2 - R1 = R1 - R2 = empty. However, if we treat those as
continuous, then we get a contradiction:

R2 contains R1
R1 does not contain R2
R2 - R1 = ( 2009-01-01 01:00:10, 2009-01-01 01:00:10.000001 ) = empty?

So clearly, the granularity can make a difference in the semantics. It
can only be treated as an implementation detail if we say "nobody will
ever have timestamps that close," which seems pretty flimsy. I think
everyone would agree that seconds are plenty large enough that conflicts
are likely, and there are only a million microseconds per second. It's
easy to imagine that such cases might appear if you are using interval
math or if you are collecting data from multiple inputs constantly
(network log data, sensor networks, etc).

A couple problems though: one is float types. Technically they can be
treated as discrete, but the granularity is variable, which may pose
some implementation awkwardness (for instance, trying to make use of
division or multiplication for granules won't work). I suppose we can
write off float4/8 because the users should expect some fuzziness, and
we can treat them as continuous. But if we want floating-point
timestamps to be supported by range types at all, do we treat those as
continuous while integer timestamps are discrete? Seems odd.

Also, there's a good case for continuous ranges for types like NUMERIC,
but still continuous ranges don't seem quite as important overall. One
way to make NUMERIC ranges discrete is by specifying a granule size; and
that has the added benefit that you can use that for other types as well
(like timestamp ranges where you only care about second granularity).
But then there are problems, like someone trying to say "float4 range
with granularity 0.1", which is asking for trouble. I suppose we could
say again that people using float4 are expecting some fuzziness.

The way I see it, there are three approaches to resolving these issues:

1. Support only discrete ranges, possibly allowing the specification of
the granule.
2. Support two different ways to specify range types: one for discrete
and one for continuous.
3. Combine the discrete and continuous interfaces by making the API a
higher level and forcing the type definition to make all of the
decisions. I think it's possible to do this in such a way that postgres
doesn't really care whether a type is continuous or discrete, based on
some ideas from Nathan Boley. This is "Approach 2" on wiki page.

#1 allows for #2 to happen later, so I don't think it makes sense to
choose #2 at this stage. However, #1 may preclude doing #3 later.

Right now I'm leaning toward #3. I like this approach quite a lot, and
it only has two real disadvantages:

* Harder to do sanity checks at the time of definition.
* No easy way to declare a range type that's slightly different from
an existing type. For instance, you can't just specify a different
granule with simple syntax, you'd need to handle that in the functions
used to define the range type.

I don't think either of those things are a major problem. Defining a new
range type requires a bit of thought to get right, and I don't think we
can come up with one interface that forces users to get it right. So,
it's reasonable to say that defining a new range type that's not
built-in should be an extension, rather than just a little DDL. The
important part of this feature is that postgres will know what types are
range types and how to manipulate them.

Also, if we start with #3, we might be able to add some syntax sugar on
top to make it easy to define new range types with simple DDL in some
simple situations.

Thoughts?

Regards,
Jeff Davis

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2010-10-24 21:39:12 Re: ask for review of MERGE
Previous Message Tom Lane 2010-10-24 21:34:16 Re: WIP: extensible enums