Skip site navigation (1) Skip section navigation (2)

Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-31 21:33:18
Message-ID: 4D472A9E.2090901@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>   
>> 3. Pause for 3 seconds after every fsync.
>>     
>
>   
>> I think something along the lines of #3 is probably a good idea,
>>     
>
> Really?  Any particular delay is guaranteed wrong.
>
>   

'3 seconds' is just a placeholder for whatever comes out of a "total 
time scheduled to sync / relations to sync" computation.  (Still doing 
all my thinking in terms of time, altough I recognize a showdown with 
segment-based checkpoints is coming too)

I think the right way to compute "relations to sync" is to finish the 
sorted writes patch I sent over a not quite right yet update to already, 
which is my next thing to work on here.  I remain pessimistic that any 
attempt to issue fsync calls without the maximum possible delay after 
asking kernel to write things out first will work out well.  My recent 
tests with low values of dirty_bytes on Linux just reinforces how bad 
that can turn out.  In addition to computing the relation count while 
sorting them, placing writes in-order by relation and then doing all 
writes followed by all syncs should place the database right in the 
middle of the throughput/latency trade-off here.  It will have had the 
maximum amount of time we can give it to sort and flush writes for any 
given relation before it is asked to sync it.  I don't want to try and 
be any smarter than that without trying to be a *lot* smarter--timing 
individual sync calls, feedback loops on time estimation, etc.

At this point I have to agree with Robert's observation that splitting 
checkpoints into checkpoint_write_target and checkpoint_sync_target is 
the only reasonable thing left that might be possible complete in a 
short period.  So that's how this can compute the total time numerator here.

The main thing I will warn about in relations to discussion today is the 
danger of true dead-line oriented scheduling in this area.  The 
checkpoint process may discover the sync phase is falling behind 
expectations because the individual sync calls are taking longer than 
expected.  If that happens, aiming for the "finish on target anyway" 
goal puts you right back to a guaranteed nasty write spike again.  I 
think many people would prefer logging the overrun as tuning feedback 
for the DBA rather than to accelerate, which is likely to make the 
problem even worse if the checkpoint is falling behind.  But since 
ultimately the feedback for this will be "make the checkpoints longer or 
increase checkpoint_sync_target", sync acceleration to meet the deadline 
isn't unacceptable; DBA can try both of those themselves if seeing spikes.

-- 
Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2011-01-31 21:35:53
Subject: Re: SSI patch version 15
Previous:From: Josh BerkusDate: 2011-01-31 21:30:57
Subject: Re: Error code for "terminating connection due to conflict with recovery"

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group