Re: best db schema for time series data?

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: best db schema for time series data?
Date: 2010-11-19 17:13:58
Message-ID: 87mxp5p76x.fsf@cbbrowne.afilias-int.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

vindex+lists-pgsql-performance(at)apartia(dot)org (Louis-David Mitterrand)
writes:
> On Tue, Nov 16, 2010 at 11:35:24AM -0500, Chris Browne wrote:
>> vindex+lists-pgsql-performance(at)apartia(dot)org (Louis-David Mitterrand)
>> writes:
>> > I have to collect lots of prices from web sites and keep track of their
>> > changes. What is the best option?
>> >
>> > 1) one 'price' row per price change:
>> >
>> > create table price (
>> > id_price primary key,
>> > id_product integer references product,
>> > price integer
>> > );
>> >
>> > 2) a single 'price' row containing all the changes:
>> >
>> > create table price (
>> > id_price primary key,
>> > id_product integer references product,
>> > price integer[] -- prices are 'pushed' on this array as they change
>> > );
>> >
>> > Which is bound to give the best performance, knowing I will often need
>> > to access the latest and next-to-latest prices?
>>
>> I'd definitely bias towards #1, but with a bit of a change...
>>
>> create table product (
>> id_product serial primary key
>> );
>>
>> create table price (
>> id_product integer references product,
>> as_at timestamptz default now(),
>> primary key (id_product, as_at),
>> price integer
>> );
>
> Hi Chris,
>
> So an "id_price serial" on the price table is not necessary in your
> opinion? I am using "order by id_price limit X" or "max(id_price)" to
> get at the most recent prices.

It (id_price) is an extra piece of information that doesn't reveal an
important fact, namely when the price was added.

I'm uncomfortable with adding data that doesn't provide much more
information, and it troubles me when people put a lot of interpretation
into the meanings of SERIAL columns.

I'd like to set up some schemas (for experiment, if not necessarily to
get deployed to production) where I'd use DCE UUID values rather than
sequences, so that people wouldn't make the error of imagining meanings
in the values that aren't really there.

And I suppose that there lies a way to think about it... If you used
UUIDs rather than SERIAL, how would your application break?

And of the ways in which it would break, which of those are errors that
fall from:

a) Ignorant usage, assuming order that isn't really there? (e.g. - a
SERIAL might capture some order information, but UUID won't!)

b) Inadequate data capture, where you're using the implicit data
collection from SERIAL to capture, poorly, information that should
be expressly captured?

When I added the timestamp to the "price" table, that's intended to
address b), capturing the time that the price was added.

>> The query to get the last 5 prices for a product should be
>> splendidly efficient:
>>
>> select price, as_at from price
>> where id_product = 17
>> order by as_at desc limit 5;
>>
>> (That'll use the PK index perfectly nicely.)
>>
>> If you needed higher performance, for "latest price," then I'd add a
>> secondary table, and use triggers to copy latest price into place:
>>
>> create table latest_prices (
>> id_product integer primary key references product,
>> price integer
>> );
>
> I did the same thing with a 'price_dispatch' trigger and partitioned
> tables (inheritance). It's definitely needed when the price database
> grow into the millions.
>
> Thanks,

The conversations are always interesting! Cheers!
--
output = ("cbbrowne" "@" "gmail.com")
http://www3.sympatico.ca/cbbrowne/x.html
FLORIDA: If you think we can't vote, wait till you see us drive.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Klemme 2010-11-20 00:16:23 Re: best db schema for time series data?
Previous Message Jignesh Shah 2010-11-19 14:52:47 Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?