From: | Surafel Temesgen <surafel3000(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | vik(dot)fearing(at)2ndquadrant(dot)com, Mark Dilger <hornschnorter(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, andrew(at)tao11(dot)riddles(dot)org(dot)uk |
Subject: | Re: FETCH FIRST clause PERCENT option |
Date: | 2019-01-03 12:00:29 |
Message-ID: | CALAY4q_4DwEJMp=474C8a1vRAtx9S=iW7g44MRQxX1J=mPg3aw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi
On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:
> Hi,
>
> I've been looking at this patch today, and I think there's a couple of
> issues with the costing and execution. Consider a simple table with 1M
> rows:
>
> create table t as select i from generate_series(1,1000000) s(i);
>
> and these two queries, that both select 1% of rows
>
> select * from t fetch first 10000 rows only;
>
> select * from t fetch first 1 percent rows only;
>
> which are costed like this
>
> test=# explain select * from t fetch first 10000 rows only;
> QUERY PLAN
> -----------------------------------------------------------------
> Limit (cost=0.00..144.25 rows=10000 width=4)
> -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
> (2 rows)
>
> test=# explain select * from t fetch first 1 percent rows only;
> QUERY PLAN
> -----------------------------------------------------------------
> Limit (cost=0.00..14425.00 rows=1000000 width=4)
> -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
> (2 rows)
>
> There are two issues with the costs in the "percent" case.
>
> Firstly, the estimated number of rows should be about the same (10k) in
> both cases, but the "percent" case incorrectly uses the total row count
> (i.e. as if 100% rows were returned).
>
> Ok I will correct it
It's correct that the total cost for the "percent" case is based on 100%
> of rows, of course, because the code has to fetch everything and stash
> it into the tuplestore in LIMIT_INITIAL phase.
>
> But that implies the startup cost for the "percent" case can't be 0,
> because all this work has to be done before emitting the first row.
>
>
Correct, startup cost must be equal to total cost of source data path and
total cost have to be slightly greater than startup cost . I am planing to
increase the total cost by limitCount * 0.1(similar to the
parallel_tuple_cost) because getting tuple from tuplestore are almost
similar operation to passing a tuple from worker to master backend.
> So these costing aspects need fixing, IMHO.
>
>
> The execution part of the patch seems to be working correctly, but I
> think there's an improvement - we don't need to execute the outer plan
> to completion before emitting the first row. For example, let's say the
> outer plan produces 10000 rows in total and we're supposed to return the
> first 1% of those rows. We can emit the first row after fetching the
> first 100 rows, we don't have to wait for fetching all 10k rows.
>
>
but total rows count is not given how can we determine safe to return row
Regards
Surafel
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2019-01-03 12:03:21 | Re: pg_upgrade: Pass -j down to vacuumdb |
Previous Message | Peter Eisentraut | 2019-01-03 11:56:58 | Re: log bind parameter values on error |