Quick Links

Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant

From:	Richard Guo <guofenglinux(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant
Date:	2022-10-13 08:17:16
Message-ID:	CAMbWs4-ic4i0=6T47UcrYYNFO64rQQ2TDcTzN41nK+SqA+Frrw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Oct 13, 2022 at 2:48 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Thu, 13 Oct 2022 at 16:47, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> > Currently in the patch the optimization is done before we check for
> > presorted paths or do the explicit sort of the cheapest path. How about
> > we move this optimization into the branch where we've found any
> > presorted paths? Maybe something like:
>
> I've attached a patch to that effect, but it turns out a bit more
> complex than what you imagined. We still need to handle the case for
> when there's no path that has the required pathkeys and we must add a
> SortPath to the cheapest path. That requires adding some similar code
> to add the LimitPath after the foreach loop over the pathlist is over.

Thanks for the new patch. Previously I considered we just apply this
optimization for adequately-presorted paths so that we can just fetch
the first row from that path. But yes we can also do this optimization
for explicit-sort case so that we can get the result from a top-1
heapsort, just like the new patch does.

I was also getting some weird plans like:
>
> regression=# explain select distinct on (four) four,hundred from tenk1
> where four=0 order by 1,2;
> QUERY PLAN
> ----------------------------------------------------------------------
> Sort (cost=0.20..0.20 rows=1 width=8)
> Sort Key: hundred
> -> Limit (cost=0.00..0.19 rows=1 width=8)
> -> Seq Scan on tenk1 (cost=0.00..470.00 rows=2500 width=8)
> Filter: (four = 0)
>
> To stop the planner from adding that final sort, I opted to hack the
> LimitPath's pathkeys to say that it's already sorted by the
> PlannerInfo's sort_pathkeys. That feels slightly icky, but it does
> seem a little wasteful to initialise a sort node on every execution of
> the plan to sort a single tuple.

I don't get how this plan comes out. It seems not correct because Limit
node above an unsorted path would give us an unpredictable row. I tried
this query without the hack to LimitPath's pathkeys and I get plans
below, with or with index scan:

explain (costs off) select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
QUERY PLAN
-----------------------------------------------------
Result
-> Limit
-> Index Scan using tenk1_hundred on tenk1
Filter: (four = 0)

explain (costs off) select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
QUERY PLAN
----------------------------------
Limit
-> Sort
Sort Key: hundred
-> Seq Scan on tenk1
Filter: (four = 0)

Thanks
Richard

In response to

Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant at 2022-10-13 06:48:31 from David Rowley

Responses

Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant at 2022-10-13 10:43:23 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2022-10-13 08:33:01	libpq support for NegotiateProtocolVersion
Previous Message	Amit Langote	2022-10-13 08:14:30	Re: ExecRTCheckPerms() and many prunable partitions