Re: postgresql latency & bgwriter not doing its job

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-26 09:58:00
Message-ID: 20140826095800.GK21544@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
>
> >Uh. I'm not surprised you're facing utterly horrible performance with
> >this. Did you try using a *large* checkpoints_segments setting? To
> >achieve high performance
>
> I do not seek "high performance" per se, I seek "lower maximum latency".

So?

> I think that the current settings and parameters are designed for high
> throughput, but do not allow to control the latency even with a small load.

The way're you're setting them is tuned for 'basically no write
activity'.

> >you likely will have to make checkpoint_timeout *longer* and increase
> >checkpoint_segments until *all* checkpoints are started because of "time".
>
> Well, as I want to test a *small* load in a *reasonable* time, so I did not
> enlarge the number of segments, otherwise it would take ages.

Well, that way you're testing something basically meaningless. That's
not helpful either.

> If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
> 0.9" so that the checkpoints are triggered by the timeout,
>
> LOG: checkpoint starting: time
> LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
> file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
> total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
> ...
>
> The result is basically the same (well 18% transactions lost, but the result
> do not seem to be stable one run after the other), only there are more
> checkpoints.

With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.

> I fail to understand how multiplying both the segments and time would solve
> the latency problem. If I set 30 segments than it takes 20 minutes to fill
> them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
> test.

a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.

> >There's three reasons:
> >a) if checkpoint_timeout + completion_target is large and the checkpoint
> >isn't executed prematurely, most of the dirty data has been written out
> >by the kernel's background flush processes.
>
> Why would they be written by the kernel if bgwriter has not sent them??

I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.

The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.

> >c) If checkpoint's are infrequent enough, the penalty of them causing
> >problems, especially if not using ext4, plays less of a role overall.
>
> I think that what you suggest would only delay the issue, not solve it.

The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru Hanada 2014-08-26 10:20:33 Re: Compute attr_needed for child relations (was Re: inherit support for foreign tables)
Previous Message Emre Hasegeli 2014-08-26 09:44:33 Re: Selectivity estimation for inet operators