Prefetch index pages for B-Tree index scans

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Prefetch index pages for B-Tree index scans
Date: 2012-10-18 20:30:26
Message-ID: CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've noticed, doing some reporting queries once, that index scans fail
to saturate server resources on compute-intensive queries.

Problem is, just after fetching a page, postgres starts computing
stuff before fetching the next. This results in I/O - compute - I/O -
compute alternation that results in idle CPU and disk, both.

I've also noticed earlier patches attempted to implement prefetch of
index pages, yet they don't seem to be committed. I'm wondering why.

The patches themselves were quite complex, attempting to prefetch heap
tuples in addition to index tuples. I can see how this would be
beneficial, but heap prefetch is notoriously more complicated, and
with the advent of index-only scans, maybe index-page-only prefetching
would be of use.

To test this hypothesis, I wrote the attached patch. pgbench doesn't
seem to see any performance impact (expected, since pgbench uses
really tiny queries that touch a single page). Using pgbench's data
with a scale of 1000, I tried some other queries that were more taxing
of index-only scans. This was done on my personal computer, with a
really lame I/O subsystem (compared to database servers), and with
9.2.1 rather than git, but I think it should be significant anyway. I
will try to verify on RAID-ed disks, but I'm in short supply of test
systems at the moment.

Pgbench's biggest index (on pgbench_accounts.aid) is not
unsurprisingly quite sequential, since it has been just created. So, I
tested both forward and backward index scans, to get an idea of how
the patch impacts on sequential and non-sequential workloads. I
haven't been able to test truly random ones yet - I'm trying to load
up a dump from a production database that's both big and messy to
test.

A few things worth noting, is that the patch avoids doing prefetch on
single-page index access, and only when block numbers aren't
sequentially increasing (only when scanning the index nonsequentially
will prefetch be attempted), since I noticed the fadvise call in those
cases was being counterproductive.

So, here we go:

The base query I used, is:

explain analyze select aid, count(*) from pgbench_accounts where aid
between 10000000 and 200000000 group by aid order by aid;

For backward scans, just order by aid desc.

The full command used is
sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' ; pg_ctl start -l
${PGLOG} ; sleep 5 ; ( sleep 10 ; echo 'set effective_io_concurrency
to 0;' ; echo 'explain analyze select aid, count(*) from
pgbench_accounts where aid between 10000000 and 200000000 group by aid
order by aid;' ) | psql -h localhost -p 5433 pgbench ; sleep 5 ;
pg_ctl stop
server starting

The server is started and stopped every time to make sure the shared
cache is empty, and sleeps are there to avoid backlash I/O from
dropping caches to influence the benchmark.

Results:

With effective_io_concurrency set to 0 (patch disabled):

Forward:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=47.552..31113.353 rows=90000001 loops=1)
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.00..2795179.71 rows=90257289 width=4) (actual
time=47.542..13197.982 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 33648.500 ms
I/O thoughtput averages 60MB/s (clearly sequential)
I/O utilization averages around 30%

Backward:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=10.279..159704.590 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=10.266..132853.382 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 163202.869 ms
I/O thoughput averages 11MB/s (clearly not fully random, but neither sequential)
I/O utilization averages 68%

With effective_io_concurrency set to 1 (patch enabled):

Forward:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=47.582..30474.222 rows=90000001 loops=1)
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.00..2795179.71 rows=90257289 width=4) (actual
time=47.571..12208.340 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 32875.695 ms
I/O thoughput still averages 60MB/s (tiny performance imrpovement is
probably just variation, although in the multiple tests I've made
there is a systematic bias towards lower times, probably the effect of
prefetch on the few cases where the index jumps nonsequentially)
I/O utilization remains around 35%

Backward:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=28.190..157708.405 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=28.178..135282.317 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 160735.539 ms
I/O thoughput averages 12MB/s (a small increase), and the 3-second
difference seems related to it (it's consistent).
I/O utilization averages 88% (important increase)

This last result makes me think deeper prefetching could be
potentially beneficial (it would result in read merges), but it's
rather hard to implement without a skiplist of leaf pages. Maybe the
backward-sequential pattern could be detected. I'll have to tinker
with that.

With a hot OS cache, there's no discernible difference in timings.

For heap-including scans forward (using sum(abalance) instead of
count(*)), the times are 418522.583ms with, and 425497.436ms without
(still very small, still consistent improvement).
For heap-including scans backward, the times are 1764391.372ms with
(with an I/O of 8.4MB/s at 95% utilization), and 1794715.164ms wihtout
(7.8MB/s at 92% utilization). So, not whopping, but better.

However small the improvement (5% on backward scans), since I see no
ill effect, I thought I'd post the patch. Besides, I'm convinced
first, with more chaotic indexes, the improvement should be noticeably
bigger, and second, with more complex queries, the parallelization may
be too. Especially with queries involving more than one index scan.
This I believe since, if I attach an strace to the backend process,
the improvement with prefetch jumps to 30% on backward scans, probably
due to the increased computation window between reads.

Attachment Content-Type Size
postgresql-git-bt_prefetch.diff application/octet-stream 5.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2012-10-18 20:39:29 Re: embedded list
Previous Message Alvaro Herrera 2012-10-18 20:22:06 Re: Skip checkpoint on promoting from streaming replication