Quick Links

Re: Processing btree walks as a batch to parallelize IO

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Processing btree walks as a batch to parallelize IO
Date:	2021-05-07 18:11:23
Message-ID:	CAAaqYe8uaSeK-k8mESzAHxg8m5-iime_6mNgEi-dv4rMS255=Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Apr 9, 2021 at 4:57 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 4/9/21 7:33 PM, James Coleman wrote:
> > $SUBJECT is still a very loosely formed idea, so forgive lack of detail
> > or things I've likely missed, but I wanted to get it out there to see if
> > it sounded at all intriguing to people.
> >
> > Background: One of the big problems with non-local storage such as AWS
> > EBS volumes or a SAN is that in a large database (really, working set,
> > where working set includes reads) exceeds the size of buffer cache (and
> > page cache) the cost of random page reads hitting the underlying disk
> > system dominates. This is because networked disks have an order of
> > magnitude higher latency than a bunch of RAIDed SSDs (even more so with
> > NVMe storage). In some of our experiments on Aurora I've seen a 10x
> > change versus pretty good physical hardware, and I'd assume RDS (since
> > it's EBS-backed) is similar.
> >
> > A specific area where this is particularly painful is btree index reads.
> > Walking the tree to leaf pages isn't naturally prefetchable, and so for
> > each level you pay the random page cost. Of course higher levels in the
> > tree will almost certainly exhibit emergent behavior such that they
> > (just by fact of the LRU caching) will be in the buffer cache, but for a
> > large index lower levels likely won't be.
> >
>
> What do you consider a large index level?

In general it's probably all levels but the leaves (though depends on
cache and index size etc.)

> Consider a 1TB table, with just a single UUID column - that's ~25B rows,
> give or take. Real tables will have more columns, so this seems like a
> reasonable model of the largest number of rows per relation. With ~32B
> per index tuple, that's about 100M leaf pages, and with ~256 branches
> per internal page, that's still only ~5 levels. I think it's quite rare
> to see indexes with more than 6 or 7 levels.
>
> And the internal pages are maybe 0.5% of the whole index (so ~4GB out of
> 750GB). I think the usual expectation is that most of that will fit into
> RAM, but of course there may be more indexes competing for that.
>
> I think the index level is not really the crucial bit - it's more about
> the total amount of indexes in the DB.

I suppose? If the tables/indexes/etc. size is sufficiently large
relative to cache size it won't matter the quantity.

> > If we squint a bit, insertions look a whole lot like reads as well since
> > we have to walk the tree to find the leaf insertion page for a new
> > tuple. This is particularly true for indexes where inserts are roughly
> > randomly distributed data, like a uuid.
> >
>
> Yep. We need to walk the index to the leaf pages in both cases, both for
> read and insert workloads.
>
> > The read-for-lookups problem is harder to solve, but the cost as it
> > relates to table inserts is possibly more tractable. Tables typically
> > have more than one index to update, so the obvious approach is "let's
> > just parallelize the index insertions". Of course we know that's
> > difficult given the multi-process approach Postgres uses for parallelism.
> >
>
> Hmm. Not sure if reads are harder to real with, but I think you're right
> those two cases (reads and writes) may look similar at the level of a
> single index, but may need rather different approaches exactly because
> inserts have to deal with all indexes, while reads only really deal with
> a single index.

Right. In practice it's harder to deal with a single index scan
because you don't have multiple such scans to parallelize.

> FWIW I think there are a couple options for improving reads, at least in
> some cases.
>
> 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page
> ahead. We can't look further ahead, but perhaps this would help.
>
> 2) In some cases (e.g. nested loop with inner indexes scan) we could
> collect an array of values and then look them up at once, which should
> allow us to do at least some fo the I/O in parallel, I think. That's
> similar to what you propose for writes, except that it works against the
> same index.

The "collect an array of values" approach isn't one I'd considered,
but seems likely interesting.

> > Another approach that at first glance seems like it fits better into
> > Postgres (I'm not claiming it's easy or a small patch) would be to
> > process a batch of indexes at once. For example, if the index access
> > methods were extended to allow being given a list of indexes that need
> > to be walked, then the btree code could process each layer in the walk
> > as a group -- issuing IO fetches for all of the first level blocks in
> > the tree, and then computing all of the next level blocks needed and
> > issuing those IO requests at a time, and so on.
> >
>
> Yeah, I agree having a way to say "prefetch all pages needed to insert
> these keys into these indexes" might be better than just parallelizing
> it in a "naive" way.
>
> Not sure how complex would it be - I think the API would need to allow
> traversing the index with each step split into two phases:
>
> 1) determine the page needed for the next step, return it to caller
>
> 2) the caller collects pages from all indexes, initiates prefetch
>
> 3) instruct indexes to actually do the next step, stop if it's a leaf
> page (otherwise go to (1))
>
> And then we might just do index inserts in a serial way, just like we do
> today, hoping to hit the prefetched pages.

Correct; this is roughly what I was envisioning.

> FWIW while this probably helps saturating the I/O, it unfortunately does
> nothing to reduce the write amplification - we still need to modify the
> same amount of leaf pages in all indexes, produce the same amount of WAL
> etc. I think there were some proposals to add small internal buffers,
> and instead of pushing the inserts all the way down to the leaf page,
> just add them to the internal buffer. And when the buffer gets full,
> propagate the contents to the next level of buffers.
>
> For example, each internal page might have one "buffer" page, so the
> index size would not really change (the internal pages would double, but
> it's still jut ~1% of the total index size). Of course, this makes
> lookups more complex/expensive, because we need to check the internal
> buffers. But it does reduce the write amplification, because it combines
> changes to leaf pages.

I think I've seen that discussion, and it's very interesting, but also
I think still orthogonal to this.

> > In some workloads we've been testing I believe such an approach could
> > plausibly improve table insert (and update) performance by multiple
> > hundreds of percent.
> >
> > I don't have any code at the moment to show here, but I wanted to get
> > the idea out there to see if there were any immediate reactions or other
> > thoughts on the topic.
> >
> > Thoughts?
> >
>
> I think you're right indexes may be a serious bottleneck in some cases,
> so exploring ways to improve that seems useful. Ultimately I think we
> should be looking for ways to reduce the amount of work we need to do,
> but parallelizing it (i.e. doing the same amount of work but in multiple
> processes) is a valid approach too.

Thanks for the feedback.

James

In response to

Re: Processing btree walks as a batch to parallelize IO at 2021-04-09 20:57:44 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2021-05-07 18:24:29	plan with result cache is very slow when work_mem is not enough
Previous Message	Andres Freund	2021-05-07 18:04:24	Re: [HACKERS] Comparing primary/HS standby in tests