Re: BUG #16869: GROUP BY on primary key unnecessarily triggers a full table scan

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: carroll(at)clwainwright(dot)net, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16869: GROUP BY on primary key unnecessarily triggers a full table scan
Date: 2021-02-19 08:55:57
Message-ID: CAMbWs49X=vuexTVEoGVqtcuVETzVbzPkqRF32ca6LWB25XU73Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On Tue, Feb 16, 2021 at 5:55 PM PG Bug reporting form <
noreply(at)postgresql(dot)org> wrote:

>
> Then the following query causes a full table scan (about 100ms on my
> computer):
> ```
> SELECT id, name, data FROM foo GROUP BY id ORDER BY data LIMIT 10;
> ```
> The EXPLAIN ANALYZE output is:
> ```
> Limit (cost=8755.26..8755.28 rows=10 width=226) (actual
> time=103.486..103.495 rows=10 loops=1)
> -> Sort (cost=8755.26..9005.26 rows=100000 width=226) (actual
> time=103.484..103.489 rows=10 loops=1)
> Sort Key: data
> Sort Method: top-N heapsort Memory: 26kB
> -> Group (cost=0.29..6594.29 rows=100000 width=226) (actual
> time=0.037..66.625 rows=100000 loops=1)
> Group Key: id
> -> Index Scan using foo_pkey on foo (cost=0.29..6344.29
> rows=100000 width=226) (actual time=0.033..31.239 rows=100000 loops=1)
> Planning Time: 0.158 ms
> Execution Time: 103.589 ms
> ```
>

For this query, when building scan paths for 'foo', the index scan path
based on 'foo_idx' is not generated from the beginning, because its
pathkey (sorted by 'data') is considered not usefull regarding
query_pathkeys (which represents the sort order on 'id' here).

However, even if we have the index scan path on 'foo_idx', we are still
unable to have incremental sort based on it, because it does not have
any presorted keys compared to group_pathkeys.

> Note the index scan on "foo_pkey", which should not be necessary.
> Obviously,
> the grouping doesn't do anything here, but it would if I had joined with
> some secondary table on which I wanted to aggregate (which is how I found
> this bug in the first place). If I drop the GROUP BY the query executes
> very
> quickly (< 1ms). Interestingly, the following queries *also* execute very
> quickly using the index:
> ```
> SELECT name, data FROM foo GROUP BY name, data ORDER BY data LIMIT 10;
> SELECT COALESCE(id) as foo_id, name, data FROM foo GROUP BY foo_id, name,
> data ORDER BY data LIMIT 10;
> ```
> This last query, which ought to be functionally equivalent to `GROUP BY id`
> yields the following execution plan:
> ```
> Limit (cost=67.83..69.38 rows=10 width=226) (actual time=0.139..0.158
> rows=10 loops=1)
> -> Group (cost=67.83..15636.35 rows=100000 width=226) (actual
> time=0.137..0.152 rows=10 loops=1)
> Group Key: data, (COALESCE(id)), name
> -> Incremental Sort (cost=67.83..14886.35 rows=100000 width=226)
> (actual time=0.134..0.137 rows=10 loops=1)
> Sort Key: data, (COALESCE(id)), name
> Presorted Key: data
> Full-sort Groups: 1 Sort Method: quicksort Average Memory:
> 27kB Peak Memory: 27kB
> -> Index Scan using foo_idx on foo (cost=0.29..6344.29
> rows=100000 width=226) (actual time=0.028..0.093 rows=33 loops=1)
> Planning Time: 0.212 ms
> Execution Time: 0.205 ms
> ```
> Note that it's properly using the "foo_idx" index here, and executes very
> fast.
>

Yes, this time the group_pathkeys (as well as query_pathkeys) are the
sort order on 'data, foo_id, name'. As a result we can leverage
'foo_idx' and incremental sort to avoid the full table scan, since there
is LIMIT clause here.

>
> So something is going on when grouping by the primary key. This seems like
> a
> bug, or, at the very least, very unintuitive behavior.
>

Agree.

Thanks
Richard

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-02-19 10:00:38 BUG #16874: Postgres Server crashes at commit
Previous Message Vik Fearing 2021-02-19 08:29:32 Re: BUG #16867: savepoints vs. commit and chain