Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-08-24 07:15:44
Message-ID: alpine.DEB.2.10.1508240810170.14924@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Amit,

>> Can the script be started on its own at all?
>
> I have tried like below which results in same error, also I tried few
> other variations but could not succeed.
> ./avg.py

Hmmm... Ensure that the script is readable and executable:

sh> chmod a+rx ./avg.py

Also check the file:

sh> file ./avg.py
./avg.py: Python script, UTF-8 Unicode text executable

>> Sure... This is *already* the case with the current checkpointer, the
>> schedule is performed with respect to the initial number of buffers it
>> think it will have to write, and if someone else writes these buffers then
>> the schedule is skewed a little bit, or more... I have not changed this
>
> I don't know how good or bad it is to build further on somewhat skewed
> logic,

The logic is no more skewed that it is with the current version: your
remark about the estimation which may be wrong in some cases is clearly
valid, but it is orthogonal (independent, unrelated, different) to what is
addressed by this patch.

I currently have no reason to believe that the issue you raise is a major
performance issue, but if so it may be addressed by another patch by
whoever want to do so.

What I have done is to demonstrate that generating a lot of random I/Os is
a major performance issue (well, sure), and this patch addresses this
point and provide major speedup (*3-5) and latency reductions (from +60%
unavailability to nearly full availability) for high OLTP write load, by
reordering and flushing checkpoint buffers in a sensible way.

> but the point is that unless it is required why to use it.

This is really required to avoid predictable performance regressions, see
below.

>> I do not think that Heikki version worked wrt to balancing writes over
>> tablespaces,
>
> I also think that it doesn't balances over tablespaces, but the question
> is why do we need to balance over tablespaces, can we reliably predict
> in someway which indicates that performing balancing over tablespace can
> help the workload.

The reason for the tablespace balancing is that in the current postgres
buffers are written more or less randomly, so it is (probably) implicitely
and statistically balanced over tablespaces because of this randomness,
and indeed, AFAIK, people with multi tablespace setup have not complained
that postgres was using the disks sequentially.

However, once the buffers are sorted per file, the order becomes
deterministic and there is no more implicit balancing, which means that if
someone has a pg setup with several disks it will write sequentially on
these instead of in parallel.

This regression was pointed out by Andres Freund, I agree that such a
regression for high end systems must be avoided, hence the tablespace
balancing.

> I think here we are doing more engineering than required for this patch.

I do not think so, I think that Andres remark is justified to avoid a
performance regression on high end systems which use tablespaces, which is
really undesirable.

About the balancing code, it is not that difficult, even if it is not
trivial: the point is to select the tablespace for which the progress
ratio (written/to_write) is below the overall progress ratio, so that it
catches up, and do so in a round robin maner, so that all tablespaces get
to write things. I also have both written a proof and tested the logic (in
a separate script).

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-08-24 07:46:59 Re: Declarative partitioning
Previous Message Stefan Kaltenbrunner 2015-08-24 06:33:24 Re: (full) Memory context dump considered harmful