Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-11-12 16:44:40
Message-ID: alpine.DEB.2.10.1511121722140.15029@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


>> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
>> while a flush is in progress (a short time) that would prevent mdclose to do
>> its stuff.
>
> Could you expand a bit more on this? You're suggesting something like a
> boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

> If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulated
into the flush context and released after the flush is performed and the
fd is not necessary anymore for this purpose, which is expected to be
short (at worst between the wake & sleep of the checkpointer, and just one
file at a time).

>> I'm concious that the patch only addresses *checkpointer* writes, not those
>> from bgwrither or backends writes. I agree that these should need to be
>> addressed at some point as well, but given the time to get a patch through,
>> the more complex the slower (sort propositions are 10 years old), I think
>> this should be postponed for later.
>
> I think we need to have at least a PoC of all of the relevant
> changes. We're doing these to fix significant latency and throughput
> issues, and if the approach turns out not to be suitable for
> e.g. bgwriter or backends, that might have influence over checkpointer's
> design as well.

Hmmm. See below.

>>> What I did not expect, and what confounded me for a long while, is that
>>> for workloads where the hot data set does *NOT* fit into shared buffers,
>>> sorting often led to be a noticeable reduction in throughput. Up to
>>> 30%.
>>
>> I did not see such behavior in the many tests I ran. Could you share more
>> precise details so that I can try to reproduce this performance regression?
>> (available memory, shared buffers, db size, ...).
>
>
> I generally found that I needed to disable autovacuum's analyze to get
> anything even close to stable numbers. The issue in described in
> http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
> otherwise badly kicks in. I basically just set
> autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.
>
> I'll show actual numbers at some point yes. I tried three different systems:
>
> * my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
> shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. I
would advise against a short timeout esp. in a high throughput system, the
whole point of the checkpointer is to accumulate as much changes as
possible.

I'll look into that.

>> This explanation seems to suggest that if bgwriter/workders write are sorted
>> and/or coordinated with the checkpointer somehow then all would be well?
>
> Well, you can't easily sort bgwriter/backend writes stemming from cache
> replacement. Unless your access patterns are entirely sequential the
> data in shared buffers will be laid out in a nearly entirely random
> order. We could try sorting the data, but with any reasonable window,
> for many workloads the likelihood of actually achieving much with that
> seems low.

Maybe the sorting could be shared with others so that everybody uses the
same order?

That would suggest to have one global sorting of buffers, maybe maintained
by the checkpointer, which could be used by all processes that need to
scan the buffers (in file order), instead of scanning them in memory
order.

For this purpose, I think that the initial index-based sorting would
suffice. Could be resorted periodically with some delay maintained in a
guc, or when significant buffer changes have occured (read & writes).

>> ISTM that this explanation could be checked by looking whether
>> bgwriter/workers writes are especially large compared to checkpointer writes
>> in those cases with reduced throughput? The data is in the log.
>
> What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater than
what is written by the checkpointer. If all fits in shared buffers,
bgwriter/worker mostly do not need to write anything and the checkpointer
does all the writes.

The larger the memory needed, the more likely workers/bgwriter will have
to quick in and generate random I/Os because nothing sensible is currently
done, so this is consistent with your findings, although I'm surprised
that it would have a large effect on throughput, as already said.

>> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
>> effective
>
> You mean, as evidenced by the results, or is that what you'd actually
> expect?

What I would expect...

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2015-11-12 16:51:28 Re: psql: add \pset true/false
Previous Message Fujii Masao 2015-11-12 16:44:35 Re: BUG #13741: vacuumdb does not accept valid password