Skip site navigation (1) Skip section navigation (2)

Postgresql Historical data store

From: ben(dot)carlyle(at)invensys(dot)com
To: general(at)humbug(dot)org(dot)au
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Postgresql Historical data store
Date: 2002-06-11 23:35:08
Message-ID: OFBFC2E035.E89BBA31-ON4A256BD5.007BD758@westsig.com.au (view raw or flat)
Thread:
Lists: pgsql-novice
G'day,

As part of my work-related software development I've been asked to develop 
a system related to my company's product that is capable of storing 
time-ordered values for a set of inputs. The data is in the form of a 
IEEE-style double and must be timestamped with at least millisecond 
accuracy over the period of up to a year (or perhaps longer for other 
projects). I can expect around 20 million changes per day between all the 
inputs although many inputs will not change very much at all. I also need 
redundancy.

:)

I'm on top of most aspects of this piece of software, in fact I've written 
a prototype to model the kind of data distrubution I want from a 
software-guy point of view using a very simple but elegant sql server 
called sqlite[1]. I'm now rethinking the entire design from the standpoint 
of implimenting it in a postgresql database.

Since I'm not a database guru I thought I'd call upon the combined wisom 
of the HUMBUG oracle on two main points which I'll discuss in some detail:

1) Sample-based queries
and
2) Postgresql built-in redundancy vs and external redundancy solution

Sample Based Queries:
To minimise space and to provide an accurate representation of the data 
I'm storing I've so far opted to store an initial value and time for each 
input at startup of the History system, then store transitions with 
transition timestamps for each of the values. This is a reasonable 
approach, but it makes a mockery of the basic sql aggregate functions. 
It's meaningless to calculate an average or standard deviation on a 
dataset where each entry in the dataset can have any period of time 
associated with it. In fact, for an average the appropriate way to 
calculate the average is something more like:
Avg = sum(entry value * entry period) / total time
I also need to be able to extract fixed period samples or instantaneous 
values for particular instants in time. There are more complications 
including a requirement to know when the data was "good" and being 
collected correctly verses when it was down due to a failure of the source 
or the history or some other factor... but I think I can deal with that 
myself ;)

So based on data of the form[2]:

Time    Value
1.01    1.005
1.02    1.006
1.09    2.504
2.34    2.459

I should be able to request a sample every 0.02 seconds from 1.01 and get 
a result of
Time    Value
1.01    1.005
1.03    1.006
1.05    1.006
1.07    1.006
1.09    2.504
1.11    2.504
... snip ...
2.35    2.459
and be able to do averages and other transforms on that result set, and 
also be able to do the true average calculations which would need to get 
the time difference between each entry and it's next entry:
Avg = ((1.02-1.01)*1.005 + (1.09-1.02)*1.006 + (2.34 - 1.09) * 2.504) / 
(2.34 - 1.01)

So... coming to the actual question:
Are these kinds of transforms possible using simple SQL?
If not... Are these kinds of transforms possible using PL/pgSQL or by 
coding some C into a shared library and introducing that as a fuction? Do 
temporary tables have a part to play in this, and how does one ensure that 
the temporary tables "live" only as long as the query takes to process?

Although I've read most of the documentation relating to these fuctions on 
the postgresql web site it all seems to refer to functions that return 
simple typed values instead of performing any kind of transformation on 
the data set. Ideally I'd like any client application to be able to do a 
select on a data set and be able to trigger the appropriate functions 
internal to the server rather than bolting my own post-processing 
application on to the top of the system.


... anyway.

On to question 2:

Postgresql built-in redundancy vs and external redundancy solution:

The requirements for redundancy are fairly simple.
1) Get data to disk as soon as possible
2) Get data to redundant copies as soon as possible
3) Redundant copies are completely redundant databases
4) Islanding is possible at any time for any duration
5) Failure is possible at any time for any duration
6) The data of communicating peers must become consistent as soon as 
possible after failure or islanding restoration
7) Redundancy is n-way based on configuration parameters
8) Data can arrive in any order with any timestamp

In my prototypical model I've done all the redundancy in software. I have 
a good infrastructure at my disposal for distributed assignment of control 
and also for communication. I've based the redundancy on a stream concept 
where the controlling history is the source of data at any time and writes 
changes directly into a new file associated with the stream and labelled 
with the host identifier and start time of the stream. A subscription 
mechanism is provided so that all non-controller entities can subscribe to 
the controller's version of the streams list and to any stream that they 
do not have the complete version of.

In the case of host failure control may transition and a new stream may be 
started. The old stream will be left incomplete in all servers until the 
failed host restores and the controller is able to determine whether the 
dead host has information in it's database that had not made it into the 
other server's store and vice versa. For this reason the controller keeps 
an eye on all non-controllers to grab any information that they may have 
and it does not so that it can both store it locally and distribute it out 
to other non-controllers.

In the case of host islanding multiple controllers may be assigned and 
therefore multiple simultaneous streams can be started. When islanding is 
recovered from all streams are synchronised so that both sets of data are 
available to all databases. This also has sql implications, because for 
overlapping data from different streams we really want to perform an 
intelligent merge:

Bad data mixed with bad data gives us bad data
Good data mixed with bad data gives us good data
Good data mixed with conflicting good data gives us bad data
Good data mixed with non-conflicting good data gives us good data

All of this work in the prototype basically happens on the query end of 
the system.

If a host comes up and it's database has been completely destroyed it will 
automatically download the list of streams and start synchronising them 
thereby ensuring consistency as soon as possible.

So the question is...

Can I achive similar reliablilty using postgresql, or am I better of 
adapting my existing system to interface to postgresql?

[1] sqlite is public domain software, and quite nice for small tasks :)
[2] Time is actually an absolute UTC time extracted from a timeval, and 
probably stored in the database as a timestamp type


Responses

pgsql-novice by date

Next:From: John TaylorDate: 2002-06-12 15:02:07
Subject: How efficient are Views
Previous:From: Jon EricksonDate: 2002-06-11 21:52:46
Subject: Re: Changed function

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group