Re: Allow a per-tablespace effective_io_concurrency setting

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Allow a per-tablespace effective_io_concurrency setting
Date: 2015-09-02 21:25:36
Message-ID: 55E76950.2070403@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hi,

On 09/02/2015 08:49 PM, Greg Stark wrote:
> On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
>>
>>
>>> + /*----------
>>> + * The user-visible GUC parameter is the number of drives (spindles),
>>> + * which we need to translate to a number-of-pages-to-prefetch target.
>>> + * The target value is stashed in *extra and then assigned to the actual
>>> + * variable by assign_effective_io_concurrency.
>>> + *
>>> + * The expected number of prefetch pages needed to keep N drives busy is:
>>> + *
>>> + * drives | I/O requests
>>> + * -------+----------------
>>> + * 1 | 1
>>> + * 2 | 2/1 + 2/2 = 3
>>> + * 3 | 3/1 + 3/2 + 3/3 = 5 1/2
>>> + * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
>>> + * n | n * H(n)
>>
>> I know you just moved this code. But: I don't buy this formula. Like at
>> all. Doesn't queuing and reordering entirely invalidate the logic here?
>
> I can take the blame for this formula.
>
> It's called the "Coupon Collector Problem". If you hit get a random
> coupon from a set of n possible coupons, how many random coupons would
> you have to collect before you expect to have at least one of each.
>
> This computation model assumes we have no information about which
> spindle each block will hit. That's basically true for the case of
> bitmapheapscan for most cases because the idea of bitmapheapscan is to
> be picking a sparse set of blocks and there's no reason the blocks
> being read will have any regularity that causes them all to fall on
> the same spindles. If in fact you're reading a fairly dense set then
> bitmapheapscan probably is a waste of time and simply reading
> sequentially would be exactly as fast or even faster.

There are different meanings of "busy". If I get the coupon collector
problem right (after quickly skimming the wikipedia article today), it
effectively makes sure that each "spindle" has at least 1 request in the
queue. Which sucks in practice, because on spinning rust it makes
queuing (TCQ/NCQ) totally inefficient, and on SSDs it only saturates one
of the multiple channels.

On spinning drives, it's usually good to keep the iodepth>=4. For
example this 10k Seagate drive [1] can do ~450 random IOPS with
iodepth=16, while 10k drive should be able to do ~150 IOPS (with
iodepth=1). The other SAS drives behave quite similarly.

[1]
http://www.storagereview.com/seagate_enterprise_performance_10k_hdd_savvio_10k6_review

On SSDs the good values usually start at 16, depending on the model (and
controller), and size (large SSDs are basically multiple small ones
glued together, thus have more channels).

This is why the numbers from coupon collector are way too low in many
cases. (OTOH this is done per backend, so if there are multiple backends
doing prefetching ...)

>
> We talked about this quite a bit back then and there was no dispute
> that the aim is to provide GUCs that mean something meaningful to the
> DBA who can actually measure them. They know how many spindles they
> have. They do not know what the optimal prefetch depth is and the only
> way to determine it would be to experiment with Postgres. Worse, I

As I explained, spindles have very little to do with it - you need
multiple I/O requests per device, to get the benefit. Sure, the DBAs
should know how many spindles they have and should be able to determine
optimal IO depth. But we actually say this in the docs:

A good starting point for this setting is the number of separate
drives comprising a RAID 0 stripe or RAID 1 mirror being used for
the database. (For RAID 5 the parity drive should not be counted.)
However, if the database is often busy with multiple queries
issued in concurrent sessions, lower values may be sufficient to
keep the disk array busy. A value higher than needed to keep the
disks busy will only result in extra CPU overhead.

So we recommend number of drives as a good starting value, and then warn
against increasing the value further.

Moreover, ISTM it's very unclear what value to use even if you know the
number of devices and optimal iodepth. Setting (devices * iodepth)
doesn't really make much sense, because that effectively computes

(devices*iodepth) * H(devices*iodepth)

which says "there are (devices*iodepth) devices, make sure there's at
least one request for each of them", right? I guess we actually want

(devices*iodepth) * H(devices)

Sadly that means we'd have to introduce another GUC, because we need
track both ndevices and iodepth.

There probably is a value X so that

X * H(X) ~= (devices*iodepth) * H(devices)

but it's far from clear that's what we need (it surely is not in the docs).

> think the above formula works for essentially random I/O but for
> more predictable I/O it might be possible to use a different formula.
> But if we made the GUC something low level like "how many blocks to
> prefetch" then we're left in the dark about how to handle that
> different access pattern.

Maybe. We only use this in Bitmap Index Scan at this point, and I don't
see any proposals to introduce this to other places. So no opinion.

>
> I did speak to a dm developer and he suggested that the kernel could
> help out with an API. He suggested something of the form "how many
> blocks do I have to read before the end of the current device". I
> wasn't sure exactly what we would do with something like that but it
> would be better than just guessing how many I/O operations we need
> to issue to keep all the spindles busy.

I don't really see how that would help us?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2015-09-02 21:31:20 Re: exposing pg_controldata and pg_config as functions
Previous Message Tom Lane 2015-09-02 21:25:18 Re: exposing pg_controldata and pg_config as functions

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2015-09-02 21:31:35 Re: Allow a per-tablespace effective_io_concurrency setting
Previous Message Julien Rouhaud 2015-09-02 20:21:48 Re: Allow a per-tablespace effective_io_concurrency setting