Re: Performance issues with large amounts of time-series data

From: Hrishikesh (हृषीकेश मेहेंदळे) <hashinclude(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance issues with large amounts of time-series data
Date: 2009-08-26 18:39:40
Message-ID: a7c00d4b0908261139l7a12990bx7ee4f6324723e02b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Tom,

Thanks for your quick response.

2009/8/26 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> <hashinclude(at)gmail(dot)com> writes:
> > In my timing tests, the performance of PG is quite a lot worse than the
> > equivalent BerkeleyDB implementation.
>
> Are you actually comparing apples to apples?  I don't recall that BDB
> has any built-in aggregation functionality.  It looks to me like you've
> moved some work out of the client into the database.

I'm measuring end-to-end time, which includes the in-code aggregation
with BDB (post DB fetch) and the in-query aggregation in PG.

> > 1. Is there anything I can do to speed up performance for the queries?
>
> Do the data columns have to be bigint, or would int be enough to hold
> the expected range?  SUM(bigint) is a *lot* slower than SUM(int),
> because the former has to use "numeric" arithmetic whereas the latter
> can sum in bigint.  If you want to keep the data on-disk as bigint,
> but you know the particular values being summed here are not that
> big, you could cast in the query (SUM(data_1::int) etc).

For the 300-sec tables I probably can drop it to an integer, but for
3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
However, given that I'm on a 64-bit platform (sorry if I didn't
mention it earlier), does it make that much of a difference? How does
a float ("REAL") compare in terms of SUM()s ?

> I'm also wondering if you've done something to force indexscans to be
> used.  If I'm interpreting things correctly, some of these scans are
> traversing all/most of a partition and would be better off as seqscans.
One thing I noticed is that if I specify what devices I want the data
for (specifically, all of them, listed out as DEVICE IN (1,2,3,4,5...)
in the WHERE clause, PG uses a Bitmap heap scan, while if I don't
specify the list (which still gives me data for all the devices), PG
uses a sequential scan. (I might have missed the DEVICE IN (...) in my
earlier query). However, more often than not, the query _will_ be of
the form DEVICE IN (...). If I actually execute the queries (on the
psql command line), their runtimes are about the same (15s vs 16s)

> >      shared_buffers = 128MB
>
> This is really quite lame for the size of machine and database you've
> got.  Consider knocking it up to 1GB or so.

OK, I've bumped it up to 1 GB. However, that doesn't seem to make a
huge difference (unless I need to do the same on libpqxx's connection
object too).

Cheers,
Hrishi

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-08-26 18:52:14 Re: Performance issues with large amounts of time-series data
Previous Message Tom Lane 2009-08-26 18:01:43 Re: Performance issues with large amounts of time-series data