Re: Performance while loading data and indexing

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: Performance while loading data and indexing
Date: 2002-09-26 09:05:19
Message-ID: 20020926090519.GB10471@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers pgsql-performance

I'll preface this by saying that while I have a large database, it doesn't
require quite the performace you're talking about here.

On Thu, Sep 26, 2002 at 02:05:44PM +0530, Shridhar Daithankar wrote:
> 1) Database load time from flat file using copy is very high
> 2) Creating index takes huge amount of time.
> 3) Any suggsestions for runtime as data load and query will be going in
> parallel.

You're loading all the data in one copy. I find that INSERTs are mostly
limited by indexes. While index lookups are cheap, they are not free and
each index needs to be updated for each row.

I fond using partial indexes to only index the rows you actually use can
help with the loading. It's a bit obscure though.

As for parallel loading, you'll be limited mostly by your I/O bandwidth.
Have you measured it to take sure it's up to speed?

> Now the details. Note that this is a test run only..
>
> Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
> RedHat7.2/PostgreSQL7.1.3
>
> Database in flat file:
> 125,000,000 records of around 100 bytes each.
> Flat file size 12GB
>
> Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
> Create unique composite index on 2 char and a timestamp field: 25226 sec.
> Database size on disk: 26GB
> Select query: 1.5 sec. for approx. 150 rows.

So you're loading at a rate of 860KB per sec. That's not too fast. How many
indexes are active at that time? Triggers and foreign keys also take their
toll.

> Important postgresql.conf settings
>
> sort_mem = 12000
> shared_buffers = 24000
> fsync=true (Sad but true. Left untouched.. Will that make a difference on
> SCSI?)
> wal_buffers = 65536
> wal_files = 64

fsync IIRC only affects the WAL buffers now but it may be quite expensive,
especially considering it's running on every transaction commit. Oh, your
WAL files are on a seperate disk from the data?

> Initial flat data load: 250GB of data. This has gone up since last query. It
> was 150GB earlier..
> Ongoing inserts: 5000/sec.
> Number of queries: 4800 queries/hour
> Query response time: 10 sec.

That looks quite acheivable.

> 1) Instead of copying from a single 12GB data file, will a parallel copy from
> say 5 files will speed up the things?

Limited by I/O bandwidth. On linux vmstat can tell you how many blocks are
being loaded and stored per second. Try it. As long as sync() doesn't get
done too often, it should be help.

> Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5
> setup..

No, it's not. You should be able to do better.

> 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further
> addition to improve create index performance?

Should be fine. Admittedly your indexes are taking rather long to build.

> 3) 5K concurrent inserts with an index on, will this need a additional CPU
> power? Like deploying it on dual RISC CPUs etc?

It shouldn't. Do you have an idea of what your CPU usage is? ps aux should
give you a decent idea.

> 4) Query performance is not a problem. Though 4.8K queries per sec. expected
> response time from each query is 10 sec. But my guess is some serius CPU power
> will be chewed there too..

Should be fine.

> 5)Will upgrading to 7.2.2/7.3 beta help?

Possibly, though it may be wirth it just for the features/bugfixes.

> All in all, in the test, we didn't see the performance where hardware is
> saturated to it's limits. So effectively we are not able to get postgresql
> making use of it. Just pushing WAL and shared buffers does not seem to be the
> solution.
>
> If you guys have any suggestions. let me know. I need them all..

Find the bottleneck: CPU, I/O or memory?

> Mysql is almost out because it's creating index for last 17 hours. I don't
> think it will keep up with 5K inserts per sec. with index. SAP DB is under
> evaluation too. But postgresql is most favourite as of now because it works. So
> I need to come up with solutions to problems that will occur in near future..
> ;-)

17 hours! Ouch. Either way, you should be able to do much better. Hope this
helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Shridhar Daithankar 2002-09-26 09:13:20 Re: Performance while loading data and indexing
Previous Message Shridhar Daithankar 2002-09-26 08:54:02 Re: [HACKERS] Performance while loading data and indexing

Browse pgsql-hackers by date

  From Date Subject
Next Message Shridhar Daithankar 2002-09-26 09:13:20 Re: Performance while loading data and indexing
Previous Message Shridhar Daithankar 2002-09-26 08:54:02 Re: [HACKERS] Performance while loading data and indexing

Browse pgsql-performance by date

  From Date Subject
Next Message Shridhar Daithankar 2002-09-26 09:13:20 Re: Performance while loading data and indexing
Previous Message Shridhar Daithankar 2002-09-26 08:54:02 Re: [HACKERS] Performance while loading data and indexing