Re: PATCH: Using BRIN indexes for sorted output

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2022-10-15 13:46:47
Message-ID: CALNJ-vTA-QvA3bQYwTuv6euzt1kXb4kcTHuRnpwE8gfg4YTrFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> There have been a couple discussions about using BRIN indexes for
> sorting - in fact this was mentioned even in the "Improving Indexing
> Performance" unconference session this year (don't remember by whom).
> But I haven't seen any patches, so here's one.
>
> The idea is that we can use information about ranges to split the table
> into smaller parts that can be sorted in smaller chunks. For example if
> you have a tiny 2MB table with two ranges, with values in [0,100] and
> [101,200] intervals, then it's clear we can sort the first range, output
> tuples, and then sort/output the second range.
>
> The attached patch builds "BRIN Sort" paths/plans, closely resembling
> index scans, only for BRIN indexes. And this special type of index scan
> does what was mentioned above - incrementally sorts the data. It's a bit
> more complicated because of overlapping ranges, ASC/DESC, NULL etc.
>
> This is disabled by default, using a GUC enable_brinsort (you may need
> to tweak other GUCs to disable parallel plans etc.).
>
> A trivial example, demonstrating the benefits:
>
> create table t (a int) with (fillfactor = 10);
> insert into t select i from generate_series(1,10000000) s(i);
>
>
> First, a simple LIMIT query:
>
> explain (analyze, costs off) select * from t order by a limit 10;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Limit (actual time=1879.768..1879.770 rows=10 loops=1)
> -> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
> Sort Key: a
> Sort Method: top-N heapsort Memory: 25kB
> -> Seq Scan on t
> (actual time=0.007..1353.110 rows=10000000 loops=1)
> Planning Time: 0.083 ms
> Execution Time: 1879.786 ms
> (7 rows)
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Limit (actual time=1.217..1.219 rows=10 loops=1)
> -> BRIN Sort using t_a_idx on t
> (actual time=1.216..1.217 rows=10 loops=1)
> Sort Key: a
> Planning Time: 0.084 ms
> Execution Time: 1.234 ms
> (5 rows)
>
> That's a pretty nice improvement - of course, this is thanks to having a
> perfectly sequential, and the difference can be almost arbitrary by
> making the table smaller/larger. Similarly, if the table gets less
> sequential (making ranges to overlap), the BRIN plan will be more
> expensive. Feel free to experiment with other data sets.
>
> However, not only the LIMIT queries can improve - consider a sort of the
> whole table:
>
> test=# explain (analyze, costs off) select * from t order by a;
>
> QUERY PLAN
> -------------------------------------------------------------------------
> Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
> Sort Key: a
> Sort Method: external merge Disk: 117528kB
> -> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
> Planning Time: 0.110 ms
> Execution Time: 3766.825 ms
> (6 rows)
>
> test=# explain (analyze, costs off) select * from t order by a;
> QUERY PLAN
>
>
> ----------------------------------------------------------------------------------
> BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
> loops=1)
> Sort Key: a
> Planning Time: 0.073 ms
> Execution Time: 2939.324 ms
> (4 rows)
>
> Right - not a huge difference, but still a nice 25% speedup, mostly due
> to not having to spill data to disk and sorting smaller amounts of data.
>
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
>
> 1) The paths are created in build_index_paths() because that's what
> creates index scans (which the new path resembles). But that is expected
> to produce IndexPath, not BrinSortPath, so it's not quite correct.
> Should be somewhere "higher" I guess.
>
> 2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
> FIRST/LAST does not really matter for them. The patch just generates
> paths for all 4 combinations (or tries to). Maybe there's a better way.
>
> 3) I'm not quite sure the separation of responsibilities between
> opfamily and opclass is optimal. I added a new amproc, but maybe this
> should be split differently. At the moment only minmax indexes have
> this, but adding this to minmax-multi should be trivial.
>
> 4) The state changes in nodeBrinSort is a bit confusing. Works, but may
> need cleanup and refactoring. Ideas welcome.
>
> 5) The costing is essentially just plain cost_index. I have some ideas
> about BRIN costing in general, which I'll post in a separate thread (as
> it's not specific to this patch).
>
> 6) At the moment this only picks one of the index keys, specified in the
> ORDER BY clause. I think we can generalize this to multiple keys, but
> thinking about multi-key ranges was a bit too much for me. The good
> thing is this nicely combines with IncrementalSort.
>
> 7) Only plain index keys for the ORDER BY keys, no expressions. Should
> not be hard to fix, though.
>
> 8) Parallel version is not supported, but I think it shouldn't be
> possible. Just make the leader build the range info, and then let the
> workers to acquire/sort ranges and merge them by Gather Merge.
>
> 9) I was also thinking about leveraging other indexes to quickly
> eliminate ranges that need to be sorted. The node does evaluate filter,
> of course, but only after reading the tuple from the range. But imagine
> we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
> that case we might skip many ranges entirely. Essentially like a bitmap
> index scan does, except that building the bitmap incrementally with BRIN
> is trivial - you can quickly check if a particular range matches or not.
> With other indexes (e.g. btree) you essentially need to evaluate the
> filter completely, and only then you can look at the bitmap. Which seems
> rather against the idea of this patch, which is about low startup cost.
> Of course, the condition might be very selective, but then you probably
> can just fetch the matching tuples and do a Sort.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Hi,
I am still going over the patch.

Minor: for #8, I guess you meant `it should be possible` .

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-10-15 15:23:54 Re: PATCH: Using BRIN indexes for sorted output
Previous Message Amit Kapila 2022-10-15 13:14:44 Re: create subscription - improved warning message