Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Date: 2022-11-18 00:27:45
Message-ID: 2f30ba9c-b707-55ae-0afd-68eb6a8a7708@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/14/22 10:19, David Geier wrote:
> Hi Tom,
>> There won't *be* any MCV stats for a column that ANALYZE perceives to
>> be unique, so I'm not quite sure where the claimed savings comes from.
>
> We save if one join attribute is unique while the other isn't. In that
> case stored MCV stats are read for the non-unique attribute but then
> never used. This is because MCV stats in join selectivity estimation are
> only used if they're present on both columns
>

Right - if we only have MCV on one side of the join, we currently end up
loading the MCV we have only to not use it anyway. The uniqueness is a
simple way to detect some of those cases. I'd bet the savings can be
quite significant for small joins and/or cases with large MCV.

I wonder if we might be yet a bit smarter, though.

For example, assume the first attribute is not defined as "unique" but
we still don't have a MCV (it may be unique - or close to unique - in
practice, or maybe it's just uniform distribution). We end up with

have_mcvs1 = false

Can't we just skip trying to load the second MCV? So we could do

if (have_mcvs1 && HeapTupleIsValid(vardata2.statsTuple))
{ ... try loading mcv2 ... }

Or perhaps what if we have a function that quickly determines if the
attribute has MCV, without loading it? I'd bet the expensive part of
get_attstatslot() is the deconstruct_array().

We could have a function that only does the first small loop over slots,
and returns true/false if we have a slot of the requested stakind. It
might even check the isunique flag first, to make it more convenient.

And only if both sides return "true" we'd load the MCV, deconstruct the
array and all that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-11-18 00:35:44 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Jeff Davis 2022-11-18 00:24:24 Re: allowing for control over SET ROLE