Re: general questions postgresql performance config

From: Jayadevan M <Jayadevan(dot)Maymala(at)ibsplc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: general questions postgresql performance config
Date: 2010-01-27 06:04:37
Message-ID: OF5E3D214E.C1992ACF-ON652576B8.00201B0D-652576B8.0021B2C8@ibsplc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,
Regarding Pentaho - please keep in mind that Pentaho needs significant
amount of memory. We had a lot of issues with Pentaho crashing with java
out of memory error. If you are using a 64 bit machine, you may be able to
give it sufficient RAM and keep it happy. If all you have is one 4 GB
machine to run PostgreSQL and the ETL tool, I have my doubts. It depends
on the volume of data - how many GBs, rather than the number of records.
Pentaho added PostgreSQL bulk loader as an experimental component
recently. You can try that out. Talend can generate Java or perl
components and was faster than Pentaho in our case. Since Talend community
edition did not provide a shared development environment, we opted for
Pentato.
If there is not a lot of complex 'transformations', you should be able to
manage fine with shell scripts.
Jayadevan

From: Dino Vliet <dino_vliet(at)yahoo(dot)com>
To: pgsql-general(at)postgresql(dot)org
Date: 01/25/2010 09:57 PM
Subject: [GENERAL] general questions postgresql performance config
Sent by: pgsql-general-owner(at)postgresql(dot)org

Dear postgresql people,

Introduction
Today I've been given the task to proceed with my plan to use postgresql
and other open source techniques to demonstrate to the management of my
department the usefullness and the "cost savings" potential that lies
ahead. You can guess how excited I am right now. However, I should plan
and execute at the highest level because I really want to show results.
I'm employed in the financial services.

Context of the problem
Given 25 million input data, transform and load 10 million records to a
single table DB2 database containing already 120 million records (the
whole history).

The current process is done on the MVS mainframe while the SAS system is
used to process the records (ETL like operations). The records of the two
last months (so 20 million records) are also stored in a single SAS
dataset, where users can access them through SAS running on their Windows
PC's. With SAS PC's they can also analyse the historical records in the
DB2 table on the mainframe.

These users are not tech savvy so this access method is not very
productive for them but because the data is highly valued, they use it
without complaining too much.

Currently it takes 5 to 6 hours before everything is finished.

Proof of concept
I want to showcase that a solution process like:

input-->Talend/Pentaho Kettle for ETL-->postgresql-->pentaho report
designer, is feasible while staying in the 5~6 hours processing and
loading time.

Input: flat files, position based
ETL: Pentaho Kettle or Talend to process these files
DBMS: postgresql 8 (on debian, opensuse, or freebsd)
Reporting: Pentaho report wizard

Hardware
AMD AM2 singlecore CPU with 4GB RAM
Two mirrored SATA II disks (raid-0)

Now that I have introduced my situation, I hope this list can give me some
tips, advice, examples, pitfalls regarding the requirements I have.

Questions
1) Although this is not exactly rocket science, the sheer volume of the
data makes it a hard task. Do you think my "solution" is
viable/achievable?

2) What kind of OS would you choose for the setup I have proposed? I
prefer FreeBSD with UFS2 as a filesystem, but I guess Debian with ext3
filesystems or openSUSE with ext3 or Ubuntu server with ext3 would all be
very good candidates too??

3) Would you opt for the ETL tools mentioned by me (pentaho and talend) or
just rely on the unix/linux apps like gawk, sed, perl? I'm familiar with
gawk. The ETL tools require java, so I would have to configure postgresql
to not use all the available RAM otherwise risking the java out of memory
error message. With that said, it would be best if I first configure my
server to do the ETL processing and then afterwards configure it for
database usage.

4) what values would you advice for the various postgresql.conf values
which can impact performance like shared buffers, temp_buffers, sort_mem,
etc etc? Or is this more of like an "art" where I change and restart the
db server, analyze the queries and iterate until I find optimal values?

5) Other considerations?

Thanks in advanced,

Dino

DISCLAIMER:

"The information in this e-mail and any attachment is intended only for
the person to whom it is addressed and may contain confidential and/or
privileged material. If you have received this e-mail in error, kindly
contact the sender and destroy all copies of the original communication.
IBS makes no warranty, express or implied, nor guarantees the accuracy,
adequacy or completeness of the information contained in this email or any
attachment and is not liable for any errors, defects, omissions, viruses
or for resultant loss or damage, if any, direct or indirect."

In response to

Browse pgsql-general by date

  From Date Subject
Next Message hx.li 2010-01-27 06:36:43 [GRNERAL] drop domain xx cascade
Previous Message Ken Winter 2010-01-27 03:39:27 Re: Problem with execution of an update rule