Postgresql Historical data store

From: ben(dot)carlyle(at)invensys(dot)com
To: general(at)humbug(dot)org(dot)au
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Postgresql Historical data store
Date: 2002-06-11 23:35:08
Message-ID: OFBFC2E035.E89BBA31-ON4A256BD5.007BD758@westsig.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

G'day,

As part of my work-related software development I've been asked to develop
a system related to my company's product that is capable of storing
time-ordered values for a set of inputs. The data is in the form of a
IEEE-style double and must be timestamped with at least millisecond
accuracy over the period of up to a year (or perhaps longer for other
projects). I can expect around 20 million changes per day between all the
inputs although many inputs will not change very much at all. I also need
redundancy.

:)

I'm on top of most aspects of this piece of software, in fact I've written
a prototype to model the kind of data distrubution I want from a
software-guy point of view using a very simple but elegant sql server
called sqlite[1]. I'm now rethinking the entire design from the standpoint
of implimenting it in a postgresql database.

Since I'm not a database guru I thought I'd call upon the combined wisom
of the HUMBUG oracle on two main points which I'll discuss in some detail:

1) Sample-based queries
and
2) Postgresql built-in redundancy vs and external redundancy solution

Sample Based Queries:
To minimise space and to provide an accurate representation of the data
I'm storing I've so far opted to store an initial value and time for each
input at startup of the History system, then store transitions with
transition timestamps for each of the values. This is a reasonable
approach, but it makes a mockery of the basic sql aggregate functions.
It's meaningless to calculate an average or standard deviation on a
dataset where each entry in the dataset can have any period of time
associated with it. In fact, for an average the appropriate way to
calculate the average is something more like:
Avg = sum(entry value * entry period) / total time
I also need to be able to extract fixed period samples or instantaneous
values for particular instants in time. There are more complications
including a requirement to know when the data was "good" and being
collected correctly verses when it was down due to a failure of the source
or the history or some other factor... but I think I can deal with that
myself ;)

So based on data of the form[2]:

Time Value
1.01 1.005
1.02 1.006
1.09 2.504
2.34 2.459

I should be able to request a sample every 0.02 seconds from 1.01 and get
a result of
Time Value
1.01 1.005
1.03 1.006
1.05 1.006
1.07 1.006
1.09 2.504
1.11 2.504
... snip ...
2.35 2.459
and be able to do averages and other transforms on that result set, and
also be able to do the true average calculations which would need to get
the time difference between each entry and it's next entry:
Avg = ((1.02-1.01)*1.005 + (1.09-1.02)*1.006 + (2.34 - 1.09) * 2.504) /
(2.34 - 1.01)

So... coming to the actual question:
Are these kinds of transforms possible using simple SQL?
If not... Are these kinds of transforms possible using PL/pgSQL or by
coding some C into a shared library and introducing that as a fuction? Do
temporary tables have a part to play in this, and how does one ensure that
the temporary tables "live" only as long as the query takes to process?

Although I've read most of the documentation relating to these fuctions on
the postgresql web site it all seems to refer to functions that return
simple typed values instead of performing any kind of transformation on
the data set. Ideally I'd like any client application to be able to do a
select on a data set and be able to trigger the appropriate functions
internal to the server rather than bolting my own post-processing
application on to the top of the system.

... anyway.

On to question 2:

Postgresql built-in redundancy vs and external redundancy solution:

The requirements for redundancy are fairly simple.
1) Get data to disk as soon as possible
2) Get data to redundant copies as soon as possible
3) Redundant copies are completely redundant databases
4) Islanding is possible at any time for any duration
5) Failure is possible at any time for any duration
6) The data of communicating peers must become consistent as soon as
possible after failure or islanding restoration
7) Redundancy is n-way based on configuration parameters
8) Data can arrive in any order with any timestamp

In my prototypical model I've done all the redundancy in software. I have
a good infrastructure at my disposal for distributed assignment of control
and also for communication. I've based the redundancy on a stream concept
where the controlling history is the source of data at any time and writes
changes directly into a new file associated with the stream and labelled
with the host identifier and start time of the stream. A subscription
mechanism is provided so that all non-controller entities can subscribe to
the controller's version of the streams list and to any stream that they
do not have the complete version of.

In the case of host failure control may transition and a new stream may be
started. The old stream will be left incomplete in all servers until the
failed host restores and the controller is able to determine whether the
dead host has information in it's database that had not made it into the
other server's store and vice versa. For this reason the controller keeps
an eye on all non-controllers to grab any information that they may have
and it does not so that it can both store it locally and distribute it out
to other non-controllers.

In the case of host islanding multiple controllers may be assigned and
therefore multiple simultaneous streams can be started. When islanding is
recovered from all streams are synchronised so that both sets of data are
available to all databases. This also has sql implications, because for
overlapping data from different streams we really want to perform an
intelligent merge:

Bad data mixed with bad data gives us bad data
Good data mixed with bad data gives us good data
Good data mixed with conflicting good data gives us bad data
Good data mixed with non-conflicting good data gives us good data

All of this work in the prototype basically happens on the query end of
the system.

If a host comes up and it's database has been completely destroyed it will
automatically download the list of streams and start synchronising them
thereby ensuring consistency as soon as possible.

So the question is...

Can I achive similar reliablilty using postgresql, or am I better of
adapting my existing system to interface to postgresql?

[1] sqlite is public domain software, and quite nice for small tasks :)
[2] Time is actually an absolute UTC time extracted from a timeval, and
probably stored in the database as a timestamp type

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message John Taylor 2002-06-12 15:02:07 How efficient are Views
Previous Message Jon Erickson 2002-06-11 21:52:46 Re: Changed function