Skip site navigation (1) Skip section navigation (2)

Re: Performance while loading data and indexing

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: Performance while loading data and indexing
Date: 2002-09-26 09:05:19
Message-ID: 20020926090519.GB10471@svana.org (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-hackerspgsql-performance
I'll preface this by saying that while I have a large database, it doesn't
require quite the performace you're talking about here.

On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote:
> 1) Database load time from flat file using copy is very high
> 2) Creating index takes huge amount of time.
> 3) Any suggsestions for runtime as data load and query will be going in 
> parallel.

You're loading all the data in one copy. I find that INSERTs are mostly
limited by indexes. While index lookups are cheap, they are not free and
each index needs to be updated for each row.

I fond using partial indexes to only index the rows you actually use can
help with the loading. It's a bit obscure though.

As for parallel loading, you'll be limited mostly by your I/O bandwidth.
Have you measured it to take sure it's up to speed?

> Now the details. Note that this is a test run only..
> 
> Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
> RedHat7.2/PostgreSQL7.1.3
> 
> Database in flat file: 
> 125,000,000 records of around 100 bytes each. 
> Flat file size 12GB
> 
> Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
> Create unique composite index on 2 char and a timestamp field:  25226 sec.
> Database size on disk: 26GB
> Select query: 1.5 sec. for approx. 150 rows.

So you're loading at a rate of 860KB per sec. That's not too fast. How many
indexes are active at that time? Triggers and foreign keys also take their
toll.

> Important postgresql.conf settings
> 
> sort_mem = 12000
> shared_buffers = 24000
> fsync=true (Sad but true. Left untouched.. Will that make a difference on 
> SCSI?)
> wal_buffers = 65536 
> wal_files = 64 

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

> Initial flat data load: 250GB of data. This has gone up since last query. It 
> was 150GB earlier..
> Ongoing inserts: 5000/sec. 
> Number of queries: 4800 queries/hour
> Query response time: 10 sec.

That looks quite acheivable.

> 1)  Instead of copying from a single 12GB data file,  will a parallel copy from 
> say 5 files will speed up the things? 

Limited by I/O bandwidth. On linux vmstat can tell you how many blocks are
being loaded and stored per second. Try it. As long as sync() doesn't get
done too often, it should be help.

> Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 
> setup..

No, it's not. You should be able to do better.

> 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further 
> addition to improve create index performance?

Should be fine. Admittedly your indexes are taking rather long to build.

> 3) 5K concurrent inserts with an index on, will this need a additional CPU 
> power? Like deploying it on dual RISC CPUs etc? 

It shouldn't. Do you have an idea of what your CPU usage is? ps aux should
give you a decent idea.

> 4) Query performance is not a problem. Though 4.8K queries per sec. expected 
> response time from each query is 10 sec. But my guess is some serius CPU power 
> will be chewed there too..

Should be fine.

> 5)Will upgrading to 7.2.2/7.3 beta help?

Possibly, though it may be wirth it just for the features/bugfixes.

> All in all, in the  test, we didn't see the performance where hardware is 
> saturated to it's limits. So effectively we are not able to get postgresql 
> making use of it. Just pushing WAL and shared buffers does not seem to be the 
> solution.
> 
> If you guys have any suggestions. let me know.  I need them all..

Find the bottleneck: CPU, I/O or memory?

> Mysql is almost out because it's creating index for last 17 hours. I don't 
> think it will keep up with 5K inserts per sec. with index. SAP DB is under 
> evaluation too. But postgresql is most favourite as of now because it works. So 
> I need to come up with solutions to problems that will occur in near future..
> ;-)

17 hours! Ouch. Either way, you should be able to do much better. Hope this
helps,
-- 
Martijn van Oosterhout   <kleptog(at)svana(dot)org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

In response to

Responses

pgsql-performance by date

Next:From: Shridhar DaithankarDate: 2002-09-26 09:13:20
Subject: Re: Performance while loading data and indexing
Previous:From: Shridhar DaithankarDate: 2002-09-26 08:54:02
Subject: Re: [HACKERS] Performance while loading data and indexing

pgsql-hackers by date

Next:From: Shridhar DaithankarDate: 2002-09-26 09:13:20
Subject: Re: Performance while loading data and indexing
Previous:From: Shridhar DaithankarDate: 2002-09-26 08:54:02
Subject: Re: [HACKERS] Performance while loading data and indexing

pgsql-general by date

Next:From: Shridhar DaithankarDate: 2002-09-26 09:13:20
Subject: Re: Performance while loading data and indexing
Previous:From: Shridhar DaithankarDate: 2002-09-26 08:54:02
Subject: Re: [HACKERS] Performance while loading data and indexing

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group