Re: strange slow query - lost lot of time somewhere

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: strange slow query - lost lot of time somewhere
Date: 2022-05-03 01:43:40
Message-ID: CAKFQuwbWRnxh-=sJXSV1CZwE2o8Qas40+WODW5XHmjbEZz7jrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 2, 2022 at 4:02 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> David Rowley <dgrowleyml(at)gmail(dot)com> writes:
> > On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
> >> I found a query that is significantly slower with more memory
>
> > If it was work_mem you increased, it seems strange that the plan would
> > switch over to using a Nested Loop / Memoize plan.
>
> Yeah, there's something unexplained there.
>
> I spent some time studying cost_memoize_rescan, and the only
> conclusions I arrived at were that the variable names are poorly
> chosen and the comments are unhelpful. For instance, one would
> think that est_entry_bytes is the expected size of one cache entry,
> but it seems to actually be the total space that would be occupied
> if the whole input relation were loaded into the cache.

And
> the est_cache_entries computation seems nonsensical; if it does
> make sense, the comment sure doesn't illuminate why.

My take on this is that a cache entry is keyed by a parameterization and
any given entry can have, at most, every tuple saved into it (hence the
computation of tuples*per-tuple-size). So the maximum number of hash keys
is the total available memory divided by input relation size. This upper
bound is stored in est_cache_entries. If the number of unique
parameterizations expected (at worst one-per-call) is less than this we use
that value and never evict. If it is more we use the est_cache_entries and
plan to evict.

What I'm expecting to find but don't see is that by definition each unique
parameterization must positively match a unique subset of the input
relation tuples. If we remember only those tuples that matched then at no
point should the total memory for the hash table exceed the size of the
input relation.

Now, I'm not completely confident the cache only holds matched tuples...but
if so:

From that the mpath->est_entries should be "min(hash_mem_bytes /
est_entry_bytes, 1.0) * ndistinct"
i.e., all groups or a fractional subset based upon available hash memory

Then:

ndistinct = estimate_num_groups() || calls
retention_ratio = min(hash_mem_bytes / est_entry_bytes, 1.0)
est_entries = retention_ratio * ndistinct
evict_ratio = 1.0 - retention_ratio

hit_ratio = (est_entries / ndistinct) - (ndistinct / calls) || clamp to 0.0
I don't understand the adjustment factor ndistinct/calls

evictions total cost adjustment also assumes we are evicting all tuples as
opposed to tuples/est_entries

This is a "rescan" so aside from cache management isn't the cost of
originally populating the cache already accounted for elsewhere?

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-05-03 01:48:03 Re: failures in t/031_recovery_conflict.pl on CI
Previous Message Jonathan S. Katz 2022-05-03 00:56:56 PostgreSQL 15 Beta 1 release date