Re: FETCH FIRST clause PERCENT option

From: Ryan Lambert <ryan(at)rustprooflabs(dot)com>
To: Surafel Temesgen <surafel3000(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Vik Fearing <vik(dot)fearing(at)2ndquadrant(dot)com>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Subject: Re: FETCH FIRST clause PERCENT option
Date: 2019-07-08 22:27:22
Message-ID: CAN-V+g-KwCFhBJoFGyoqBGE+A-+4fvgF23SPEiX62UPVH9+grA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Surafel,

The v5 patch [1] applies cleanly and passes make installcheck-world.

My concern with this is its performance. As a user I would expect using
this feature to enable queries to run faster than they would simply pulling
the full table. I tested on tables ranging from 10k rows up to 10 million
with the same basic result that using FETCH FIRST N PERCENT is slower than
using the full table. At best it ran slightly slower than querying the
full table, at worst it increased execution times by 1400% when using a
large high percentage (95%).

Testing was on a DigitalOcean 2 CPU, 2GB RAM droplet. All queries were
executed multiple times (6+) with EXPLAIN (ANALYZE, COSTS) and /timing
enabled, query plans provided are from the faster end of each query.

Create the test table:

DROP TABLE IF EXISTS r10mwide;
CREATE TABLE r10mwide AS
SELECT generate_series(1, 10000000)::INT AS id,
random() AS v1,
random() AS v2,
random() AS v3,
random() AS v4,
random() AS v5,
random() AS v6,
random() AS v7,
random() AS v8,
random() AS v9,
random() AS v10
;
VACUUM ANALYZE;

Start with a baseline query from the full table, the limiting will be done
within the CTE as I would typically do in a production query. The outer
query does basic aggregates. Below I provide each full query for easy
copy/paste but the only real change is how the inner results are limited.
This runs in 2.2s off the full table.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;

QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=227192.07..227192.08 rows=1 width=24) (actual
time=2230.419..2230.419 rows=1 loops=1)
-> Gather (cost=227191.84..227192.05 rows=2 width=72) (actual
time=2228.321..2231.225 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=226191.84..226191.85 rows=1 width=72)
(actual time=2218.754..2218.754 rows=1 loops=3)
-> Parallel Seq Scan on r10mwide (cost=0.00..184524.92
rows=4166692 width=16) (actual time=0.032..934.652 rows=3333333 loops=3)
Planning Time: 0.116 ms
Execution Time: 2231.935 ms
(8 rows)

Time: 2232.459 ms (00:02.232)

It did use parallel, since the FETCH FIRST queries apparently won't I
turned that off and ran it again.

SET max_parallel_workers_per_gather = 0;

New query plan for full table, no parallel, just over 4 seconds.

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=342859.21..342859.22 rows=1 width=24) (actual
time=4253.861..4253.862 rows=1 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=16)
(actual time=0.062..1728.346 rows=10000000 loops=1)
Planning Time: 0.082 ms
Execution Time: 4253.918 ms
(4 rows)

The following uses the explicit row count to pull 100k rows (1%) and gives
a massive improvement in speed, this query ranged from 55- 120ms. This is
the type of speedup I would expect, and it beats the full-table query hands
down, regardless of parallel query.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 100000 ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4428.58..4428.59 rows=1 width=24) (actual
time=55.963..55.963 rows=1 loops=1)
-> Limit (cost=0.00..2428.57 rows=100000 width=20) (actual
time=0.041..35.725 rows=100000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060
width=20) (actual time=0.039..24.796 rows=100000 loops=1)
Planning Time: 0.134 ms
Execution Time: 56.032 ms
(5 rows)

Time: 57.262 ms

Using FETCH FIRST 1 PERCENT it takes over 7 seconds, 71% slower than
querying the full table.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 1 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6857.22..6857.23 rows=1 width=24) (actual
time=7112.205..7112.206 rows=1 loops=1)
-> Limit (cost=2428.60..4857.19 rows=100001 width=20) (actual
time=0.036..7047.394 rows=100000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060
width=20) (actual time=0.033..3072.881 rows=10000000 loops=1)
Planning Time: 0.132 ms
Execution Time: 7214.302 ms
(5 rows)

Time: 7215.278 ms (00:07.215)

Cranking the percentage up to 95% took 59-75 seconds.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 95 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=651432.48..651432.49 rows=1 width=24) (actual
time=58981.043..58981.044 rows=1 loops=1)
-> Limit (cost=230715.67..461431.34 rows=9500057 width=20) (actual
time=0.017..55799.389 rows=9500000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060
width=20) (actual time=0.014..3847.146 rows=10000000 loops=1)
Planning Time: 0.117 ms
Execution Time: 59079.680 ms
(5 rows)

Even taking it way down to .001% (100 rows!) didn't run fast by any
measure, more than 6 seconds.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST .001 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6.86..6.87 rows=1 width=24) (actual
time=6414.406..6414.406 rows=1 loops=1)
-> Limit (cost=2.43..4.86 rows=100 width=20) (actual
time=0.021..6413.981 rows=100 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060
width=20) (actual time=0.017..3086.504 rows=10000000 loops=1)
Planning Time: 0.168 ms
Execution Time: 6495.686 ms

[1]
https://www.postgresql.org/message-id/attachment/102333/percent-incremental-v5.patch

*Ryan Lambert*
RustProof Labs

On Mon, Jul 8, 2019 at 1:09 AM Surafel Temesgen <surafel3000(at)gmail(dot)com>
wrote:

> Hi Thomas,
> Thank you for informing me
>
> Hi Surafel,
>>
>> There's a call to adjust_limit_rows_costs() hiding under
>> contrib/postgres_fdw, so this fails check-world.
>>
>
> Fixed . I also include the review given by Ryan in attached patch
>
> regards
> Surafel
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2019-07-08 22:31:29 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message Joe Conway 2019-07-08 22:24:40 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)