Re: MVCC for massively parallel inserts

From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: MVCC for massively parallel inserts
Date: 2004-01-06 04:46:18
Message-ID: 87smit1y79.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Alex Satrapa <alex(at)lintelsys(dot)com(dot)au> writes:

> Properly set up in a RAID-1/0 array, you'll get much better "bandwidth" out of
> those drives. Whether you RAID in software or hardware is up to you and your
> budget - but if you choose hardware, make sure you actually get a hardware RAID
> controller, not one of the cheapies which are just
> multiple-IDE-controllers-with-special-drivers.
>
> We use a 3Ware 7500 (can't remember the model name for the life of me), with 8
> drives in a 4x2 RAID 1/0.

I would agree and if you really need the I/O bandwidth you can go to much
larger stripe sets than even this. The documentation I've seen before
suggested there were benefits up to stripe sets as large as twelve disks
across. That would be 24 drives if you're also doing mirroring.

Ideally separating WAL, index, and heap files is good, but you would have to
experiment to see which works out fastest for a given number of drives.

There are also some alternative approaches that could increase your
throughput. For example, you could have your multiple machines receiving the
data log the data to text files. Then you could copy the text files over to
the database periodically and load the with COPY which is faster than a
database insert.

Also, if it fits your model you could load the data into fresh unindexed
tables and then build a new index. Building a new index is a quicker operation
than handling individual inserts. That would make selects more complex though,
but you perhaps that's not a concern.

> IIRC, if the inserts are done in a transaction, the indexing gets done at the
> end of the batch rather than after each insert.

I believe this is wrong. The whole point of postgres's style of MVCC is that
each transaction can go ahead and do whatever modifications it needs and mark
it with its transaction id, any other transaction simply ignores the data
marked with transaction ids of uncommitted transactions. When commit time
arrives there's very little work to do to do the commit beyond simply marking
the transaction as committed.

--
greg

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2004-01-06 05:04:11 Re: psql \d option list overloaded
Previous Message Alex Satrapa 2004-01-06 04:01:36 Re: MVCC for massively parallel inserts