thank you for all your comments and recommendations.
I'm aware that the conditions for this benchmark are not ideal, mostly
due to the lack of time to prepare it. We will also need an additional
benchmark on a less powerful - more realistic - server to better
understand the scability of our application.
Our application is based on java and is generating dynamic reports from
log files content. Dynamic means here that a repor will be calculated
from the postgres data the first time it is requested (it will then be
cached). Java is used to drive the data preparation and to
handle/generate the reports requests.
This is much more an OLAP system then an OLTP, at least for our
1) parsing the log files with a heavy use of perl (regular expressions)
to generate csv files. Prepared statements also maintain reference
tables in the DB. Postgres performance is not an issue for this first
2) loading the csv files with COPY. As around 70% of the data to load
come in a single daily table, we don't allow concurrent jobs for this
step. We have between a few and a few hundreds files to load into a
single table; they are processed one after the other. A primary key is
always defined; for the case when the required indexes are alreay built
and when the new data are above a given size, we are using a "shadow"
table instead (without the indexes) , build the index after the import
and then replace the live table with the shadow one.
For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).
a) is there an "ideal" size to consider for our csv files (100 x 10
MB or better 1 x 1GB ?)
b) maintenance_work_mem: I'll use around 1 GB as recommended by
3) Data agggregation. This is the heaviest part for Postgres. On our
current system some queries need above one hour, with phases of around
100% cpu use, alterning with times of heavy i/o load when temporary
results are written/read to the plate (pgsql_tmp). During the
aggregation, other postgres activities are low (at least should be) as
this should take place at night. Currently we have a locking mechanism
to avoid having more than one of such queries running concurently. This
may be to strict for the benchmark server but better reflect our current
Performances : Here we should favorise a single huge transaction and
consider a low probability to have another transaction requiring large
sort space. Considering this, is it reasonable to define work_mem being
3GB (I guess I should raise this parameter dynamically before running
the aggregation queries)
4) Queries (report generation)
We have only few requests which are not satisfying while requiring large
sort operations. The data are structured in different aggregation levels
(minutes, hours, days) with logical time based partitions in oder to
limit the data size to compute for a given report. Moreover we can scale
our infrastrucure while using different or dedicated Postgres servers
for different customers. Smaller customers may share a same instance,
each of them having its own schema (The lock mechanism for large
aggregations apply to a whole Postgres instance, not to a single
customer) . The benchmark will help us to plan such distribution.
During the benchmark, we will probably not have more than 50 not idle
connections simultaneously. It is a bit too early for us to fine tune
this part. The benchmark will mainly focus on the steps 1 to 3
During the benchmark, the Db will reach a size of about 400 GB,
simulating 3 different customers, also with data quite equally splitted
in 3 scheemas.
I will post our configuration(s) later on.
Thanks again for all your valuable input.
In response to
pgsql-performance by date
|Next:||From: Campbell, Lance||Date: 2007-07-24 17:06:13|
|Subject: Table Statistics with pgAdmin III|
|Previous:||From: Jonathan Gray||Date: 2007-07-24 09:50:18|
|Subject: Re: Query performance issue|