On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> >> Exactly. Which is the point I am making. Five minutes of transactions
> >> is nothing (speaking generally).. In short, if we are in recovery, and
> >> we are not saturated the I/O and at least a single CPU, there is a huge
> >> amount of optimization *somewhere* to be done.
> > You sure about that? I tested CVS HEAD just now, by setting the
> > checkpoint_ parameters really high, running pgbench for awhile, and
> > then killing the bgwriter to force a recovery cycle over all the WAL
> > generated by the pgbench run. What I saw was that the machine was 100%
> > disk write bound. Increasing shared_buffers helped, not in that the
> > write rate got less according to vmstat, but the completion time did.
> There are at least three definitions of "saturating the I/O" and it sounds
> like you two are using two different ones.
> 1) The processor is waiting on I/O all the time
> 2) The hard drives are all always handling a request
> 3) The hard drives are all handling the full bandwidth they're capable
> You would expect (1) and (2) to be the same for a single drive -- though in
> practice there seems to be a gap even between them. But for a raid array there
> can be a large difference, and the wider the raid stripe the larger the
> In Tom's results:
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
> > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
> > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0
> > I don't see the machine sitting around doing nothing ...
> Note that even though the processor is 99% in wait state the drive is only
> handling about 3 MB/s. That translates into a seek time of 2.2ms which is
> actually pretty fast. So if this is a single drive (1) and (2) seem to be
> pretty much the same here.
> But note that if this were a raid array Postgres's wouldn't be getting any
> better results. A Raid array wouldn't improve i/o latency at all and since
> it's already 99% waiting for i/o Postgres is not going to be able to issue any
> more. But only one drive in the raid array will be busy at a time which would
> be far less than the maximum random access i/o the raid array is capable of.
Agree with Greg's analysis here. Moving to -hackers now.
I've done performance profiling also. My results replicated Tom's, but I
hadn't performed them on a big enough system and so didn't realise the
I/O scalability issue could be such a large problem. Koichi showed me
some results on a much larger server that illustrated the I/O problem.
But lets remember its only a problem on large servers with a heavy write
workload and a large random I/O requirement. That's an important set of
people, but much less than Josh's 10% of people even.
> Heikki proposed a while back to use posix_fadvise() when processing logs to
> read-ahead blocks which the recover will need before actually attempting to
> recover them. On a raid array that would bring the 3MB/s above up to the
> maximum number of random accesses the raid array can handle (ie, definition
> (2) above).
It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.
Florian's code for Hot Standby introduces a separate recovery process,
similar to an autovacuum launcher. I propose a mechanism similar to the
AV solution where we have lots of recovery workers, with one recovery
master reading the WAL files. We can then distribute WAL records to
workers in some manner.
It's true that many WAL records depend upon each other, but its also
true that the performance problems only occur in the situation when they
the WAL records don't depend upon each other. If they did, they would
touch the same blocks and it would be cached. So as long as we have a
safe mechanism for splitting up the work, everything is fine.
We can divide up the WAL records this by looking at the rmgr field, plus
looking deeper into the records themselves so we can touch different
So I'm planning to review this *after* Florian has introduced his patch,
so we can build upon it.
In response to
pgsql-hackers by date
|Next:||From: Gregory Stark||Date: 2007-12-13 09:45:47|
|Subject: Re: Slow PITR restore|
|Previous:||From: Joshua D. Drake||Date: 2007-12-13 07:08:35|
|Subject: Re: Slow PITR restore|
pgsql-general by date
|Next:||From: D. Dante Lorenso||Date: 2007-12-13 08:38:15|
|Subject: Re: Need LIMIT and ORDER BY for UPDATE|
|Previous:||From: D. Dante Lorenso||Date: 2007-12-13 08:24:45|
|Subject: Re: For the SQL gurus out there|