Checkpoint distribution

From: Borodin Vladimir <root(at)simply(dot)name>
To: pgsql-performance(at)postgresql(dot)org
Subject: Checkpoint distribution
Date: 2014-04-14 09:46:42
Message-ID: 4AF2BEEF-5BB5-4155-BF70-D3E877914061@simply.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all.

I’m running PostgreSQL 9.3.4 and doing stress test of the database with writing only load. The test plan does 1000 transactions per second (each of them does several updates/inserts). The problem is that checkpoint is not distributed over time well. When the checkpoint finishes, the db gets lots of I/O operations and response timings grows strongly.

My checkpoint settings looks like that:

postgres=# select name, setting from pg_catalog.pg_settings where name like 'checkpoint%' and boot_val != reset_val;
name | setting
------------------------------+---------
checkpoint_completion_target | 0.9
checkpoint_segments | 100500
checkpoint_timeout | 600
(3 rows)

postgres=#

But in the log I see that checkpoint continues less than 600*0.9 = 540 seconds:

2014-04-14 12:54:41.479 MSK,,,10517,,53468da6.2915,433,,2014-04-10 16:25:10 MSK,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2014-04-14 12:57:06.107 MSK,,,10517,,53468da6.2915,434,,2014-04-10 16:25:10 MSK,,0,LOG,00000,"checkpoint complete: wrote 65140 buffers (24.8%); 0 transaction log file(s) added, 0 removed, 327 recycled; write=134.217 s, sync=10.292 s, total=144.627 s; sync files=31, longest=3.332 s, average=0.331 s",,,,,,,,,»"

When the checkpoint starts (12:54:41.479) dstat says that I/O load increases:

----system---- -dsk/total- --io/total-
date/time | read writ| read writ
14-04 12:54:39| 0 15M| 0 2562
14-04 12:54:40| 0 13M| 0 2330
14-04 12:54:41| 0 97M| 0 5981
14-04 12:54:42| 0 95M| 0 8869
14-04 12:54:43| 0 147M| 0 8493
14-04 12:54:44| 0 144M| 0 8316
14-04 12:54:45| 0 176M| 0 8189
14-04 12:54:46| 0 141M| 0 8221
14-04 12:54:47| 0 143M| 0 8260
14-04 12:54:48| 0 141M| 0 7576
14-04 12:54:49| 0 173M| 0 8171

But when it finishes (12:57:06.107) the I/O load is much higher than the hardware can do:

----system---- -dsk/total- --io/total-
date/time | read writ| read writ
14-04 12:56:52| 0 33M| 0 5185
14-04 12:56:53| 0 64M| 0 5271
14-04 12:56:54| 0 65M| 0 5256
14-04 12:56:55| 0 153M| 0 15.8k
14-04 12:56:56| 0 758M| 0 18.6k
14-04 12:56:57| 0 823M| 0 4164
14-04 12:56:58| 0 843M| 0 8186
14-04 12:56:59| 0 794M| 0 15.0k
14-04 12:57:00| 0 880M| 0 5954
14-04 12:57:01| 0 862M| 0 4608
14-04 12:57:02| 0 804M| 0 7387
14-04 12:57:03| 0 849M| 0 4878
14-04 12:57:04| 0 788M| 0 20.0k
14-04 12:57:05| 0 805M| 0 6004
14-04 12:57:06| 0 143M| 0 6932
14-04 12:57:07| 0 108M| 0 6150
14-04 12:57:08| 0 42M| 0 6233
14-04 12:57:09| 0 73M| 0 6248

Response timings of the application at this moment look like that:

The hardware is quite good to handle this load (PGDATA lives on soft raid10 array of 8 ssd drives). I’ve done the same test with 3000 tps - the result was exactly the same. The only difference was that I/O spikes had been stronger.

So my question is why the checkpoint is not spread for 540 seconds? Is there any way to understand why I/O spike happens when the checkpoint finishes but does not happen during all of the checkpoint process? Any help would be really appropriate.

--
Vladimir

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ryan Johnson 2014-04-14 12:58:14 SSI slows down over time
Previous Message Jeff Janes 2014-04-14 00:02:04 Re: batch update query performance