Re: Improved Cost Calculation for IndexOnlyScan

From: Hamid Akhtar <hamid(dot)akhtar(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improved Cost Calculation for IndexOnlyScan
Date: 2020-09-29 08:49:23
Message-ID: CANugjhsn+LCMdUxB6KsEiGSh9+0+9Aw8YDFW3HZAyhuv5ebj8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 29, 2020 at 1:08 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 29/09/2020 10:06, Hamid Akhtar wrote:
> > In one of my earlier emails [1], I mentioned that there seems to be a
> > problem with how the cost for index only scans is being calculated.
> > [1]
> >
> https://www.postgresql.org/message-id/CANugjhsnh0OBMOYc7qKcC%2BZsVvAXCeF7QiidLuFvg6zmHy1C7A%40mail.gmail.com
> >
> > My concern is that there seems to be a bigger disconnect between the
> > cost of index only scan and the execution time. Having tested this on 3
> > different systems, docker, laptop and a server with RAID 5 SSD
> > configured, at the threshold where index only scan cost exceeds that of
> > sequential scan, index only is still around 30% faster than the
> > sequential scan.
>
> A 30% discrepancy doesn't sound too bad, to be honest. The exact
> threshold depends on so many factors.
>
> > My initial hunch was that perhaps we need to consider a different
> > approach when considering cost for index only scan. However, the
> > solution seems somewhat simple.
> >
> > cost_index function in costsize.c, in case of indexonlyscan, multiplies
> > the number of pages fetched by a factor of (1.0 - baserel->allvisfrac)
> > which is then used to calculate the max_IO_cost and min_IO_cost.
> >
> > This is very similar to the cost estimate methods for indexes internally
> > call genericostesimate function. This function primarily gets the number
> > of pages for the indexes and multiplies that with random page cost
> > (spc_random_page_cost) to get the total disk access cost.
> >
> > I believe that in case of index only scan, we should adjust the
> > spc_random_page_cost in context of baserel->allvisfrac so that it
> > accounts for random pages for only the fraction that needs to be read
> > for the relation and excludes that the index page fetches.
>
> That doesn't sound right to me. The genericcostestimate() function
> calculates the number of *index* pages fetched. It makes no difference
> if it's an Index Scan or an Index Only Scan.
>
> genericcostestimate() could surely be made smarter. Currently, it
> multiplies the number of index pages fetched with random_page_cost, even
> though a freshly created index is mostly physically ordered by the keys.
> seq_page_cost with some fudge factor might be more appropriate, whether
> or not it's an Index Only Scan. Not sure what the exact formula should
> be, just replacing random_page_cost with seq_page_cost is surely not
> right either.
>
> - Heikki
>

So, not actually random replacement here, rather a change with
baserel->allvisfrac taken into consideration (as given below):
----
index_random_page_cost = Min(spc_seq_page_cost + spc_random_page_cost *
(1.0 - baserel->allvisfrac), spc_random_page_cost);
----

Does this make sense?

--
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950 EMAIL: mailto:hamid(dot)akhtar(at)highgo(dot)ca
SKYPE: engineeredvirus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-09-29 09:03:06 Re: Fix inconsistency in jsonpath .datetime()
Previous Message tsunakawa.takay@fujitsu.com 2020-09-29 08:28:03 Disable WAL logging to speed up data loading