Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT
Date: 2026-03-10 22:08:35
Message-ID: 4s3vd2rjll4vgnhdnajrqwsnhnmnmtmjtwjngfsrzmjx3nrxvp@la2vohpjojnk
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-03-10 17:03:14 -0400, Peter Geoghegan wrote:
> On Tue, Feb 10, 2026 at 4:26 AM Jakub Wartak
> <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> > I just worry that if get_actual_variable_range() starts reporting much worse
> > estimates what do we do then? Well maybe, we'll have pg_plan_advice then,
> > or maybe the depth (time?) of search should be somehow bound to the column's
> > statistics_target too, but on second thought it looks like it should be more
> > property of the query itself (so maybe some GUC?)
>
> The purpose of get_actual_variable_range is to get the extremal value
> in the index -- a single value from a single tuple. If we can't manage
> that by reading only 2 or 3 pages, then I see no good reason to
> believe we can by reading many more pages.
>
> In other words, we're perfectly justified in making a soft expectation
> that it'll require very little work to get an extremal value. But once
> we notice that our soft assumption doesn't hold, all bets are off --
> we're then justified in giving up relatively quickly. Having several
> contiguous pages that are completely empty (or empty of
> non-LP_DEAD-marked tuples) at the extreme leftmost or rightmost en of
> the index is absolutely a pathological case. It strongly suggests a
> queue-like workload of the kind that my testing shows that
> VISITED_PAGES_LIMIT can't deal with sensibly.

You don't need a queuing workload to occasionally have a few pages of dead
tuples at one end of the value range. E.g. a small batch insert that rolls
back due to a unique conflict is enough. The insert case is also the one
where, without get_actual_variable_range(), we will often end up with
completely bogus estimates, as obviously the stored stats won't yet know about
newly inserted data.

> get_actual_variable_range exists to ascertain information about a
> particular index. To me, it makes perfect sense that a mechanism such
> as this would work in terms of costs paid on the index AM side. (The
> heap fetches matter a great deal too, of course, but it's reasonable
> to use index costs as a proxy for heap/table AM costs -- but it's not
> reasonable to use table/heap AM costs as a proxy for index AM costs.
> It doesn't work both ways.)

I am not convinced it's true either direction - tids from three leaf pages of
tids can point to a very small number of heap pages or hundreds of heap
pages. Why is the index AM cost a good proxy for the table? If anything it's
more sane the other way round (i.e. how it is today).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2026-03-10 22:09:14 Re: Fix uninitialized xl_running_xacts padding
Previous Message Zsolt Parragi 2026-03-10 22:05:25 Re: [PATCH] Add pg_get_database_ddl() function to reconstruct CREATE DATABASE statement