Re: Using Postgres to store high volume streams of sensor readings

From: "Ciprian Dorin Craciun" <ciprian(dot)craciun(at)gmail(dot)com>
To: Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Using Postgres to store high volume streams of sensor readings
Date: 2008-11-21 13:26:00
Message-ID: 8e04b5820811210526s44767e8fte8445bbbda95ec59@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Nov 21, 2008 at 3:18 PM, Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com> wrote:
> you'll have to provide us with some sort of test-case to get some answers,
> please. (set of scripts, queries, etc).

Bellow is the content of my original post. Inside I mention
exactly the may the benchmark was conducted.

In short the data is inserted by using COPY sds_benchmark_data
from STDIN, in batches of 500 thousand data points.

I'll also paste the important part here:

> * Postgres version: 8.3.3;
>
> * database schema:
>> create table sds_benchmark_data (
>> client int4 not null,
>> sensor int4 not null,
>> timestamp int4 not null,
>> value int4
>> );
>>
>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>>
>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>
> * postgres.conf (the default values, I will list only what has
> been changed):
>> max_connections = 20
>> shared_buffers = 24MB
>> work_mem = 64MB
>> maintenance_work_mem = 256MB
>> fsync = off
>> checkpoint_segments = 32
>> effective_cache_size = 1024MB
>
> * inserts are done like this:
> * generated 100 million readings by using the following rule:
> * client is randomly chosen between 0 and 10 thousand;
> * sensor is randomly chosen between 0 and 10;
> * the timestamp is always increasing by one;
> * the insert is done in batches of 500 thousand inserts (I've
> also tried 5, 25, 50 and 100 thousand without big impact);
> * the banch inserts are done through COPY sds_benchmark_data
> FROM STDIN through libpq (by using UNIX (local) sockets);

Unfortunately I don't know what more information to give...

Thanks,
Ciprian Dorin Craciun.

On Fri, Nov 21, 2008 at 2:50 PM, Ciprian Dorin Craciun
<ciprian(dot)craciun(at)gmail(dot)com> wrote:
> Hello all!
>
> I would like to ask some advice about the following problem
> (related to the Dehems project: http://www.dehems.eu/ ):
> * there are some clients; (the clients are in fact house holds;)
> * each device has a number of sensors (about 10), and not all the
> clients have the same sensor; also sensors might appear and disappear
> dynamicaly; (the sensors are appliances;)
> * for each device and each sensor a reading is produced (at about
> 6 seconds); (the values could be power consumptions;)
> * I would like to store the following data: (client, sensor,
> timestamp, value);
> * the usual queries are:
> * for a given client (and sensor), and time interval, I need
> the min, max, and avg of the values;
> * for a given time interval (and sensor), I need min, max, and
> avg of the values;
> * other statistics;
>
> Currently I'm benchmarking the following storage solutions for this:
> * Hypertable (http://www.hypertable.org/) -- which has good insert
> rate (about 250k inserts / s), but slow read rate (about 150k reads /
> s); (the aggregates are manually computed, as Hypertable does not
> support other queries except scanning (in fact min, and max are easy
> beeing the first / last key in the ordered set, but avg must be done
> by sequential scan);)
> * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
> fabulos read rate (about 2M reads / s); (the same issue with
> aggregates;)
> * Postgres -- which behaves quite poorly (see below)...
> * MySQL -- next to be tested;
>
> So what can I do / how could I optimize the use of Postgres for this usage?
>
> (I'm aware that there could be optimizations for this problem
> (like computing the aggregates in memory and storing only these
> aggregates at 10 minutes, or other interval), but I want to have the
> full history (for data mining tasks for example);)
>
> I will also like to present how was the Postgres benchmark implemented:
>
> * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
> SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
> * Postgres version: 8.3.3;
>
> * database schema:
>> create table sds_benchmark_data (
>> client int4 not null,
>> sensor int4 not null,
>> timestamp int4 not null,
>> value int4
>> );
>>
>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>>
>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>
> * postgres.conf (the default values, I will list only what has
> been changed):
>> max_connections = 20
>> shared_buffers = 24MB
>> work_mem = 64MB
>> maintenance_work_mem = 256MB
>> fsync = off
>> checkpoint_segments = 32
>> effective_cache_size = 1024MB
>
> * inserts are done like this:
> * generated 100 million readings by using the following rule:
> * client is randomly chosen between 0 and 10 thousand;
> * sensor is randomly chosen between 0 and 10;
> * the timestamp is always increasing by one;
> * the insert is done in batches of 500 thousand inserts (I've
> also tried 5, 25, 50 and 100 thousand without big impact);
> * the banch inserts are done through COPY sds_benchmark_data
> FROM STDIN through libpq (by using UNIX (local) sockets);
>
> What have I observed / tried:
> * I've tested without the primary key and the index, and the
> results were the best for inserts (600k inserts / s), but the
> readings, worked extremly slow (due to the lack of indexing);
> * with only the index (or only the primary key) the insert rate is
> good at start (for the first 2 million readings), but then drops to
> about 200 inserts / s;
>
> So could someone point me where I'me wrong, or what can I do to
> optimize Postgres for this particular task?
>
> Thanks for your help,
> Ciprian Dorin Craciun.
>
> P.S.: I'll want to publish the benchmark results after they are
> done, and I want to squeeze as much power out of Postgres as possible.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Grzegorz Jaśkiewicz 2008-11-21 13:29:51 Re: Using Postgres to store high volume streams of sensor readings
Previous Message Grzegorz Jaśkiewicz 2008-11-21 13:18:30 Re: Using Postgres to store high volume streams of sensor readings