Re: FETCH FIRST clause PERCENT option

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: surafel3000(at)gmail(dot)com, vik(dot)fearing(at)2ndquadrant(dot)com, hornschnorter(at)gmail(dot)com, andres(at)anarazel(dot)de, pgsql-hackers(at)postgresql(dot)org, andrew(at)tao11(dot)riddles(dot)org(dot)uk
Subject: Re: FETCH FIRST clause PERCENT option
Date: 2019-03-01 16:18:04
Message-ID: 322d0c3f-8ea6-06b5-c0c4-419dd35beb3f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 3/1/19 2:31 AM, Kyotaro HORIGUCHI wrote:
> Hello.
>
> At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <fbd08ad3-5dd8-3169-6cba-38d610d7be7f(at)2ndquadrant(dot)com>
>>> One biggest issue seems to be we don't know the total number of
>
> # One *of* the biggest *issues*?
>
>>> outer tuples before actually reading a null tuple. I doubt of
>>> general shortcut for that. It also seems preventing limit node
>>> from just using materialized outer.
>>>
>>
>> Sure, if you actually want all tuples, you'll have to execute the outer
>> plan till completion. But that's not what I'm talking about - what if we
>> only ever need to read one row from the limit?
>
> We have no choice than once reading all tuples just to find we
> are to return just one row, since estimator is not guaranteed to
> be exact as required for this purpose.
>

I don't follow - I'm not suggesting to base this on row estimates at
all. I'm saying that we can read in rows, and decide how many rows we
are expected to produce.

For example, with 1% sample, we'll produce about 1 row for each 100 rows
read from the outer plan. So we read the first row, and compute

rows_to_produce = ceil(0.01 * rows_read)

and every time this increases, we emit another row from a tuplestore.

>> To give you a (admittedly, somewhat contrived and artificial example):
>>
>> SELECT * FROM t1 WHERE id IN (
>> SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
>> );
>>
>> Maybe this example is bogus and/or does not really matter in practice. I
>> don't know, but I've been unable to convince myself that's the case.
>
> I see such kind of idiom common. Even in the quite simple example
> above, *we* cannot tell how many tuples the inner should return
> unless we actually fetch all tuples in t2. This is the same
> problem with count(*).
>
> The query is equivalent to the folloing one.
>
> SELECT * FROM t1 WHERE id IN (
> SELECT id FROM t2 ORDER BY x
> FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
> );
>
> This scans t2 twice, but this patch does only one full scan
> moving another partial scan to tuplestore. We would win if the
> outer is complex enough.
>

But what if all t1 rows match the very first row in the subselect? In
that case we don't need to read all rows from the subplan - we only need
to read the first chunk (until we emit the first row).

> Anyway, even excluding the count(*) issue, it seems that we are
> not alone in that point. (That is, at least Oracle shows
> different E-Rows and A-Rows for PERCENT).
>

I'm not sure hat E-Rows and A-Rows are? I suppose that's estimated and
actual - but as I explained above, that's not what I propose the
incremental approach to use.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-03-01 16:35:21 Re: NOT IN subquery optimization
Previous Message Robert Haas 2019-03-01 16:17:25 Re: SQL/JSON: JSON_TABLE