Re: Sequential scans

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)enterprisedb(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sequential scans
Date: 2007-05-03 08:54:26
Message-ID: 4639A342.10305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
>
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.

They wouldn't? If you add more memory to your box, so that a table that
didn't fit in memory before does now fit, surely you want to switch your
strategy from "don't pollute the cache because it won't fit anyway" to
"let's keep it in cache, so the next scan won't do I/O".

The basic problem with partitions is that if you have a query like
"SELECT * FROM partitioned_table", so that you seq scan multiple
partitions, the size of each partition alone could be below the
threshold, whatever that is, but since you're scanning them all the net
result is the same as scanning one large table above the threshold. The
same could happen in any plan that does multiple seq scans. It could
even happen at the application level if the application submits multiple
statements like "SELECT * FROM table1", "SELECT * FROM table2" one after
each other.

One way to address that would be to manage the recycle buffer ring size
dynamically. When a backend gets a cache miss, the ring would shrink,
and when you get cache hits, it would grow. That way, if you do a seq
scan on a table that fits in cache repeatedly, that table would get more
buffers from the cache on each iteration until it's completely in cache.
But if the table or tables being scanned are too big to fit in cache,
the ring would stay small and not pollute the cache much.

I'm not going to implement that for now, I'm seeing some scary negative
feedback behavior with that, and it'd need a lot of testing anyway. I'm
thinking of just using shared_buffers as the limit. One could argue for
effective_cache_size as well.

> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.

No matter how many partitions you have, each backend could still be
scanning only one of them at a time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannes Eder 2007-05-03 09:01:48 Re: Subversion repo up
Previous Message Pavan Deolasee 2007-05-03 08:44:47 Re: Patch queue triage