Re: Cost limited statements RFC

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cost limited statements RFC
Date: 2013-06-07 15:35:18
Message-ID: 51B1FDB6.7080901@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/7/13 10:14 AM, Robert Haas wrote:
>> If the page hit limit goes away, the user with a single core server who used
>> to having autovacuum only pillage shared_buffers at 78MB/s might complain
>> that if it became unbounded.
>
> Except that it shouldn't become unbounded, because of the ring-buffer
> stuff. Vacuum can pillage the OS cache, but the degree to which a
> scan of a single relation can pillage shared_buffers should be sharply
> limited.

I wasn't talking about disruption of the data that's in the buffer
cache. The only time the scenario I was describing plays out is when
the data is already in shared_buffers. The concern is damage done to
the CPU's data cache by this activity. Right now you can't even reach
100MB/s of damage to your CPU caches in an autovacuum process. Ripping
out the page hit cost will eliminate that cap. Autovacuum could
introduce gigabytes per second of memory -> L1 cache transfers. That's
what all my details about memory bandwidth were trying to put into
context. I don't think it really matter much because the new bottleneck
will be the processing speed of a single core, and that's still a decent
cap to most people now.

> I think you're missing my point here, which is is that we shouldn't
> have any such things as a "cost limit". We should limit reads and
> writes *completely separately*. IMHO, there should be a limit on
> reading, and a limit on dirtying data, and those two limits should not
> be tied to any common underlying "cost limit". If they are, they will
> not actually enforce precisely the set limit, but some other composite
> limit which will just be weird.

I see the distinction you're making now, don't need a mock up to follow
you. The main challenge of moving this way is that read and write rates
never end up being completely disconnected from one another. A read
will only cost some fraction of what a write does, but they shouldn't be
completely independent.

Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes,
I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s
of I/O in an implementation where they float independently. It's
certainly possible to disconnect the two like that, and people will be
able to work something out anyway. I personally would prefer not to
lose some ability to specify how expensive read and write operations
should be considered in relation to one another.

Related aside: shared_buffers is becoming a decreasing fraction of
total RAM each release, because it's stuck with this rough 8GB limit
right now. As the OS cache becomes a larger multiple of the
shared_buffers size, the expense of the average read is dropping. Reads
are getting more likely to be in the OS cache but not shared_buffers,
which makes the average cost of any one read shrink. But writes are as
expensive as ever.

Real-world tunings I'm doing now reflecting that, typically in servers
with >128GB of RAM, have gone this far in that direction:

vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20

That's 4MB/s of writes, 40MB/s of reads, or some blended mix that
considers writes 10X as expensive as reads. The blend is a feature.

The logic here is starting to remind me of how the random_page_cost
default has been justified. Read-world random reads are actually close
to 50X as expensive as sequential ones. But the average read from the
executor's perspective is effectively discounted by OS cache hits, so
4.0 is still working OK. In large memory servers, random reads keep
getting cheaper via better OS cache hit odds, and it's increasingly
becoming something important to tune for.

Some of this mess would go away if we could crack the shared_buffers
scaling issues for 9.4. There's finally enough dedicated hardware
around to see the issue and work on it, but I haven't gotten a clear
picture of any reproducible test workload that gets slower with large
buffer cache sizes. If anyone has a public test case that gets slower
when shared_buffers goes from 8GB to 16GB, please let me know; I've got
two systems setup I could chase that down on now.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-06-07 15:38:21 Re: extensible external toast tuple support & snappy prototype
Previous Message Hannu Krosing 2013-06-07 15:27:28 Re: extensible external toast tuple support & snappy prototype