Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Prajna Shetty <Prajna(dot)Shetty(at)mindtree(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc: Beenu Sharma <Beenu(dot)Sharma(at)mindtree(dot)com>, Pranabesh Saha <Pranabesh(dot)Saha(at)mindtree(dot)com>, Surya Susai <Surya(dot)Susai(at)mindtree(dot)com>
Subject: Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Date: 2022-03-24 11:55:41
Message-ID: 6e31a8c8-f74b-e4b3-f5d1-dce68e68abbf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-performance

On 3/22/22 13:57, Prajna Shetty wrote:
> ++ _pgsql-performance(at)postgresql(dot)org_
> <https://www.postgresql.org/list/pgsql-performance/>
>  
> Hello Team,
>  
> There is change in query plan in 12.4 version and Version 13 resulting
> in performance slowness post upgrade.
>  
>
> * In 12.4 version, Sort Operation Group Aggregate is selected which
> results to Merge Join. Query takes ~5 seconds.
> * In 13.5 version, optimizer wrongly estimates and due to new Disk
> Based Hash Aggregate feature, it prefers Hash Aggregate instead of
> Sort Operation which finally blocks merge-join and chooses Nested
> Loop Left Join. Query takes ~5 minutes.
>
>  
> *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to
> choose merge operation but such instance level modification is not
> possible in terms of Application Functionality.
>  
> This performance issue is on all over most of queries. Attached one of
> the query and its plan in both version for reference in case that helps
> for recreating the issue.
>  

It's impossible to comment those other queries, but chances are the root
cause is the same.

> Version 13 query plan has lower estimated cost than that of 12.4 which
> implies 13.5 planner thought it found a better plan, but it is running
> slower and actual cost show more.
>  
> 12.4 Version:
> "Merge Right Join  (cost=*202198.78..295729.10* rows=1 width=8) (actual
> time=1399.727..*5224.574* rows=296 loops=1)"
>  
> 13.5 version:-
> "Nested Loop Left Join  (cost=*196360.90..287890.45* rows=1 width=8)
> (actual time=3209.577..*371300.693* rows=296 loops=1)"
>  

This is not a costing issue, the problem is that we expect 1 row and
calculate the cost for that, but then get 296. And unfortunately a
nested loop degrades much faster than a merge join.

I'm not sure why exactly 12.4 picked a merge join, chances are the
costing formular changed a bit somewhere. But as I said, the problem is
in bogus row cardinality estimates - 12.4 is simply lucky.

The problem most likely stems from this part:

-> GroupAggregate (cost=0.43..85743.24 rows=1830 width=72) (actual
time=1.621..3452.034 rows=282179 loops=3)
Group Key: student_class_detail.aamc_id
Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text)
Rows Removed by Filter: 76060
-> Index Scan using uk_student_class_detail_aamcid_classlevelcd on
student_class_detail (cost=0.43..74747.61 rows=1284079 width=6) (actual
time=1.570..2723.014 rows=1272390 loops=3)
Filter: (class_level_start_dt IS NOT NULL)
Rows Removed by Filter: 160402

The filter is bound to be misestimated, and the error then snowballs.
Try replacing this part with a temporary table (with pre-aggregated
results) - you can run analyze on it, etc. I'd bet that'll make the
issue go away.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Justin Pryzby 2022-03-24 12:24:14 Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Previous Message Andrey Borodin 2022-03-24 09:05:24 Re: BUG #17444: ERROR: found xmin 215633 from before relfrozenxid 1280585

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2022-03-24 12:24:14 Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
Previous Message Lars Aksel Opsahl 2022-03-24 09:39:59 Re: Using system tables directly takes many hours, using temp tables with no indexes takes a few seconds for geometry_columns view.