Using Btree to Provide Sorting on Suffix Keys with LIMIT

From: James Coleman <jtc331(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Using Btree to Provide Sorting on Suffix Keys with LIMIT
Date: 2018-12-30 00:00:25
Message-ID: CAAaqYe-SsHgXKXPpjn7WCTUnB_RQSxXjpSaJd32aA=Rquv0AgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

tl;dr: I'd like to teach btrees to support returning ordered results
where the ordering is only a suffix of the index keys and the query
includes a scalar array op qual on the prefix key of the index.

Suppose we have the following schema:

CREATE TABLE foos(bar_fk integer, created_at timestamp);
CREATE INDEX index_foos_on_bar_fk_and_created_at ON foos(bar_fk, created_at);

and seed it with the following:

INSERT INTO foos (bar_fk, created_at)
SELECT i % 1000, now() - (random() * '5 years'::interval)
FROM generate_series(1, 500000) t(i);

then execute the following query:

SELECT *
FROM foos
WHERE bar_fk IN (1, 2, 3)
ORDER BY created_at
LIMIT 50;

currently we get a query plan (I've disabled bitmap scans for this test) like:

Limit (cost=5197.26..5197.38 rows=50 width=12) (actual
time=2.212..2.222 rows=50 loops=1)
-> Sort (cost=5197.26..5201.40 rows=1657 width=12) (actual
time=2.211..2.217 rows=50 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 27kB
-> Index Only Scan using index_foos_on_bar_fk_and_created_at
on foos (cost=0.42..5142.21 rows=1657 width=12) (actual
time=0.025..1.736 rows=1500 loops=1)
Index Cond: (bar_fk = ANY ('{1,2,3}'::integer[]))
Heap Fetches: 1500
Planning time: 0.137 ms
Execution time: 2.255 ms

Note that the index scan (or bitmap scan) nodes return all 1500 rows
matching `bar_fk IN (1,2,3)`. After all rows are returned, that total
set is ordered, and finally the LIMIT is applied. While only 50 rows
were requested, 30x that were fetched from the heap.

I believe it is possible to use the index
`index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN
(1,2,3)` qualifier and (at least partially; more on that later) the
ordering `ORDER BY created_at` while fetching fewer than all rows
matching the qualifier.

Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
(...) order by c` and `a in (...) and b = 1 order by c` (and further
similar derivations with increasing numbers of equality quals).

Note: Another (loosely) related problem is that sorting can't
currently take advantage of cases where an index provides a partial
(prefix of requested pathkeys) ordering.

Proposal 1:

General idea: teach btrees to apply LIMIT internally so that we bound
the number of rows fetched and sorted to the <number of array op
elements> * <limit>.

Pros:

- Presumably simpler to implement.
- Ought to be faster (then master) in virtually all cases (or at least
penalty in worst case is extremely small).

Cons:

- Doesn't capture all of the theoretical value. For example, for array
of size 100, limit 25, and 100 values per array key, the current code
fetches 10,000 values, this proposal would fetch 2,500, and the best
case is 25. In this sample scenario we fetch 25% of the original
tuples, but we also fetch 100x the theoretical minimum required.
- Index scan node has to learn about limit (this feels pretty dirty).
- Still needs a sort node.

Proposal 2:

General idea: find the "first" (or last depending on sort order) index
tuple for each array op element. Sort those index tuples by the suffix
keys. Return tuples from the first of these sorted index tuple
locations until we find one that's past the next ordered index tuple.
Continue this round-robin approach until we exhaust all tuples.

Pros:

- Best case is significantly improved over proposal 1.
- Always fetches the minimal number of tuples required by the limit.
- When ordered values are not evenly distributed should be even faster
(just continue to pull tuples for array value with lowest order
value).
- Index scan node remains unaware of limit.
- Doesn't need a sort node.

Cons:

- Presumably more difficult to implement.
- Degenerate cases: when ordered values are evenly distributed we may
have to re-search from the top of the tree for each tuple (so worst
case is when few tuples match within each prefix).

---

Questions:

- Do we need a new index access method for this? Or do we shoehorn it
into the existing index scan. (Perhaps answer depends on which
strategy chosen?)
- Similarly do we want a new scan node type? (This brings up the same
kinds of questions in the skip scan discussion about multiplying node
types.)
- Is holding a pin on multiple index pages acceptable?
- Or do we avoid that at the cost of most frequent searches from the
top of the tree?
- If we have to search from the top of the tree, then does the "many
duplicates" case become degenerate (to put it another way, how do we
search to proper next tuple in a non-unique index)?

Status:
I've begun work on proposal 2 as I believe it shows the most potential
gain (though it also has higher complexity). I don't have a patch in a
state worth showing yet, but I wanted to start the discussion on the
design considerations while I continue work on the patch.

- James Coleman

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-12-30 00:36:25 Re: Performance issue in foreign-key-aware join estimation
Previous Message Tom Lane 2018-12-29 23:56:17 Re: Poor buildfarm coverage of strong-random alternatives