Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT

From: James Coleman <jtc331(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using Btree to Provide Sorting on Suffix Keys with LIMIT
Date: 2018-12-30 14:29:43
Message-ID: CAAaqYe-Ggc34kDXeDWm+fiQnhzYFZrFk5aBe5eyWzY0Y3urAXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 29, 2018 at 9:50 PM David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> On Sun, 30 Dec 2018 at 13:00, James Coleman <jtc331(at)gmail(dot)com> wrote:
> > Note that the index scan (or bitmap scan) nodes return all 1500 rows
> > matching `bar_fk IN (1,2,3)`. After all rows are returned, that total
> > set is ordered, and finally the LIMIT is applied. While only 50 rows
> > were requested, 30x that were fetched from the heap.
> >
> > I believe it is possible to use the index
> > `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN
> > (1,2,3)` qualifier and (at least partially; more on that later) the
> > ordering `ORDER BY created_at` while fetching fewer than all rows
> > matching the qualifier.
> >
> > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in
> > (...) order by c` and `a in (...) and b = 1 order by c` (and further
> > similar derivations with increasing numbers of equality quals).
>
> I don't quite understand this the above paragraph, but I assume this
> would be possible to do with some new index am routine which allowed
> multiple different values for the initial key. Probably execution for
> something like this could be handled by having 1 IndexScanDesc per
> initial key. A scan would have to scan all of those for the first
> tuple and return the lowest order key. Subsequent scans would fetch
> the next tuple for the IndexScanDesc used previously then again,
> return the lowest order tuple. There's some binary heap code and
> examples in nodeMergeAppend.c about how that could be done fairly
> efficiently.

Mostly I was pointing out that the simple case (scalar array op qual
on first index column and order by second index column) isn't the only
potentially interesting one; we'd also want to handle, for example, a
3 column index with a equality qual on the first, an array op on the
second, and order by the third. And then as you note later we could
also theoretically do this for multiple array op quals.

Thanks for the pointer to nodeMergeAppend.c; I'll look at that to see
if it sparks any ideas.

I'm intrigued by the idea of having multiple IndexScanDesc in the
node. My current route had been to include an array of BTScanPos in
BTScanOpaqueData and work within the same scan. Do you think that a
new index am targeting multiple initial values would be a better route
than improving the existing native array handling in
nbtree.c/nbtutil.c? It seems to me that fitting it into the existing
code gives us the greater potential usefulness but with more effort to
maintain the existing efficiencies there.

It sounds like multiple pinned pages (if you suggest multiple
IndexScanDesc) for the same index in the same node is acceptable? We
should still not be locking multiple pages at once, so I don't think
there's risk of deadlock, but I wasn't sure if there were specific
expectations about the number of pinned pages in a single relation at
a given time.

> The hard part about that would be knowing when to draw the line. If
> there was 1000's of initial keys then some other method might be
> better. Probably the planner would have to estimate which method was
> best. There are also issues if there are multiple prefix keys as you'd
> need to scan a cartesian product of the keys, which would likely get
> out of hand quickly with 2 or more initial keys.

Agreed. I expect the costing here to both report a higher startup cost
(since it has to look at one index tuple per array element up front)
and higher per-tuple cost (since we might have to re-search), but if
there are a very large (e.g., millions) number of rows and a small
LIMIT then it's hard to imagine this not being the better option even
up to a large number of keys (though at some point memory becomes a
concern also).

> There were discussions and a patch for a planner-level implementation
> of which could likely assist with this in [1]. I'm not sure if this
> particular case was handled in the patch. I believe it was more
> intended for queries such as: SELECT ... FROM t WHERE a = 1 OR b = 2
> and could transform this into something more along the lines of:
> SELECT .. FROM t WHERE a = 1 UNION SELECT ... FROM t WHERE b = 1, and
> using the table's ctid to uniquify the rows. You could get away with
> something similar but use UNION ALL instead. You don't need UNION
> since your "OR" is on the same column, meaning there can be no
> overlapping rows.
>
> Something like:
>
> (SELECT * FROM foos WHERE bar_fk = 1 LIMIT 50)
> UNION ALL
> (SELECT * FROM foos WHERE bar_fk = 2 LIMIT 50)
> UNION ALL
> (SELECT * FROM foos WHERE bar_fk = 3 LIMIT 50)
> ORDER BY created_at LIMIT 50;

This sounds effectively like a way to do my first proposal. In theory
I think both are valuable and potentially complementary, so I'll read
up on that one also.

> > Note: Another (loosely) related problem is that sorting can't
> > currently take advantage of cases where an index provides a partial
> > (prefix of requested pathkeys) ordering.
>
> There has been a patch [2] around for about 4 years now that does
> this. I'm unsure of the current status, other than not yet committed.

Doh, I should have linked to that; I've been following the incremental
sort patch for a while (and submitted a test-case review) since it
solves some significant problems for us.

- James Coleman

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-12-30 14:37:40 Re: Removing --disable-strong-random from the code
Previous Message Fabien COELHO 2018-12-30 10:06:52 Re: pgsql: Use a separate random seed for SQL random()/setseed() functions.