Processing btree walks as a batch to parallelize IO

From: James Coleman <jtc331(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Processing btree walks as a batch to parallelize IO
Date: 2021-04-09 17:33:31
Message-ID: CAAaqYe9UFQGXu_wWLESJL+pDas1ekAcN4-sXDej1xfFfiE2owg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

$SUBJECT is still a very loosely formed idea, so forgive lack of detail or
things I've likely missed, but I wanted to get it out there to see if it
sounded at all intriguing to people.

Background: One of the big problems with non-local storage such as AWS EBS
volumes or a SAN is that in a large database (really, working set, where
working set includes reads) exceeds the size of buffer cache (and page
cache) the cost of random page reads hitting the underlying disk system
dominates. This is because networked disks have an order of magnitude
higher latency than a bunch of RAIDed SSDs (even more so with NVMe
storage). In some of our experiments on Aurora I've seen a 10x change
versus pretty good physical hardware, and I'd assume RDS (since it's
EBS-backed) is similar.

A specific area where this is particularly painful is btree index reads.
Walking the tree to leaf pages isn't naturally prefetchable, and so for
each level you pay the random page cost. Of course higher levels in the
tree will almost certainly exhibit emergent behavior such that they (just
by fact of the LRU caching) will be in the buffer cache, but for a large
index lower levels likely won't be.

If we squint a bit, insertions look a whole lot like reads as well since we
have to walk the tree to find the leaf insertion page for a new tuple. This
is particularly true for indexes where inserts are roughly randomly
distributed data, like a uuid.

The read-for-lookups problem is harder to solve, but the cost as it relates
to table inserts is possibly more tractable. Tables typically have more
than one index to update, so the obvious approach is "let's just
parallelize the index insertions". Of course we know that's difficult given
the multi-process approach Postgres uses for parallelism.

Another approach that at first glance seems like it fits better into
Postgres (I'm not claiming it's easy or a small patch) would be to process
a batch of indexes at once. For example, if the index access methods were
extended to allow being given a list of indexes that need to be walked,
then the btree code could process each layer in the walk as a group --
issuing IO fetches for all of the first level blocks in the tree, and then
computing all of the next level blocks needed and issuing those IO requests
at a time, and so on.

In some workloads we've been testing I believe such an approach could
plausibly improve table insert (and update) performance by multiple
hundreds of percent.

I don't have any code at the moment to show here, but I wanted to get the
idea out there to see if there were any immediate reactions or other
thoughts on the topic.

Thoughts?

James

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2021-04-09 17:52:06 Re: Nicer error when connecting to standby with hot_standby=off
Previous Message Tom Lane 2021-04-09 17:30:47 Re: Reference Leak with type