Re: Bunching "transactions"

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Bunching "transactions"
Date: 2007-10-25 20:46:46
Message-ID: 60lk9q9b2h.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

jeandavid8(at)verizon(dot)net (Jean-David Beyer) writes:
> Chris Browne wrote:
>> jeandavid8(at)verizon(dot)net (Jean-David Beyer) writes:
>>> But what is the limitation on such a thing? In this case, I am just
>>> populating the database and there are no other users at such a time. I am
>>> willing to lose the whole insert of a file if something goes wrong -- I
>>> would fix whatever went wrong and start over anyway.
>>>
>>> But at some point, disk IO would have to be done. Is this just a function of
>>> how big /pgsql/data/postgresql.conf's shared_buffers is set to? Or does it
>>> have to do with wal_buffers and checkpoint_segments?
>>
>> I have done bulk data loads where I was typically loading hundreds of
>> thousands of rows in as a single transaction, and it is worth
>> observing that loading in data from a pg_dump will do exactly the same
>> thing, where, in general, each table's data is loaded as a single
>> transaction.
>
> I guess a reasonable standard of performance would be that if my initial
> population of the database takes only a little longer than a restore of the
> database using pg_restore, I am pretty close, and that is good enough. Of
> course, the restore depends on how fast my tape drive can pull the tape --
> it claims up to 12 MB/sec transfer rate, so it looks as though it will be
> tape-limited rather than postgreSQL-limited.

That's quite possible.

There is a further factor, which is that grouping things into larger
transactions has very clearly diminishing returns.

Supposing you have a stream of 50,000 operations updating one tuple
(those could be UPDATE, DELETE, or INSERT; it is not, at first order,
material what sort they are), then the effects of grouping are thus...

- With none...

Cost = cost of doing 50,000 updates
+ cost of doing 50,000 COMMITs

- If you COMMIT after every 2 updates

Cost = cost of doing 50,000 updates
+ cost of doing 25,000 COMMITs

- If you COMMIT after every 10 updates

Cost = cost of doing 50,000 updates
+ cost of doing 5,000 COMMITs

- If you COMMIT after every 100 updates

Cost = cost of doing 50,000 updates
+ cost of doing 500 COMMITs

The amount of work that COMMIT does is fairly much constant,
regardless of the number of updates in the transaction, so that the
cost, in that equation, of COMMITs pretty quickly evaporates to
irrelevancy.

And increasing the sizes of the transactions does not give you
*increasing* performance improvements; the improvements will tend to
decline.

I wouldn't worry about trying to strictly minimize the number of
transactions COMMITted; once you have grouped "enough" data into one
transaction, that should be good enough.

Further, the Right Thing is to group related data together, and come
up with a policy that is driven primarily by the need for data
consistency. If things work well enough, then don't go off trying to
optimize something that doesn't really need optimization, and perhaps
break the logic of the application.
--
output = ("cbbrowne" "@" "acm.org")
http://cbbrowne.com/info/unix.html
Users should cultivate an ability to make the simplest molehill into a
mountain by finding controversial interpretations of innocuous
sounding statements that the sender never intended or imagined.
-- from the Symbolics Guidelines for Sending Mail

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2007-10-25 20:54:01 Re: [HACKERS] 8.3beta1 testing on Solaris
Previous Message Jignesh K. Shah 2007-10-25 20:24:18 8.3beta1 testing on Solaris