Quick Links

Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Quan Zongliang <quanzongliang(at)yeah(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Date:	2023-06-17 13:45:07
Message-ID:	753270d7-a303-d1bb-1291-da717bc615a1@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 6/17/23 02:02, Quan Zongliang wrote:
>
>
> On 2023/6/17 06:46, Tom Lane wrote:
>> Quan Zongliang <quanzongliang(at)yeah(dot)net> writes:
>>> Perhaps we should discard this (dups cnt > 1) restriction?
>>
>> That's not going to happen on the basis of one test case that you
>> haven't even shown us. The implications of doing it are very unclear.
>> In particular, I seem to recall that there are bits of logic that
>> depend on the assumption that MCV entries always represent more than
>> one row. The nmultiple calculation Tomas referred to may be failing
>> because of that, but I'm worried about there being other places.
>>

I don't recall any logic that'd outright fail with MCVs containing
single-row groups, and I haven't noticed anything obvious in analyze.c
during a cursory search. Maybe the paper analyze_mcv_list builds on
makes some assumptions? Not sure.

However, compute_distinct_stats() doesn't seem to have such protection
against single-row MCV groups, so if that's wrong we kinda already have
the issue I think (admittedly, compute_distinct_stats is much less used
than compute_scalar_stats).

>
> The statistics for the other table look like this:
> stadistinct | 6
> stanumbers1 | {0.50096667,0.49736667,0.0012}
> stavalues1 | {v22,v23,v5}
>
> The value that appears twice in the small table (v1 and v2) does not
> appear here. The stadistinct's true value is 18 instead of 6 (three
> values in the small table do not appear here).
>
> When calculating the selectivity:
> if (nd2 > sslot2->nvalues)
> totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);
>
> totalsel1 = 0
> nd2 = 21
> sslot2->nvalues = 2
> unmatchfreq1 = 0.99990002016420476
> otherfreq2 = 0.82608695328235626
>
> result: totalsel1 = 0.043473913749706022
> rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800
>

Attached is a script reproducing this.

I think the fundamental issue here is that the most common element of
the large table - v22 (~50%) is not in the tiny one at all. IIRC the
join estimation assumes the domain of one table is a subset of the
other. The values 22 / 23 violate that assumption, unfortunately.

Including all values into the small MCV fix this because then

otherfreq1 = 0.0

and that simply eliminates the impact of stuff that didn't have a match
between the two MCV lists. Which mitigates the violated assumption.

But once the small table gets too large for the MCV, this won't work
that well - it probably helps a bit, as it makes otherfreq1 smaller.

Which doesn't mean it's useless, but it's likely a rare combination that
a table is (and remains) smaller than MCV, and the large table contains
values without a match in the smaller one (think foreign keys).

>
>> Basically, you're proposing a rather fundamental change in the rules
>> by which Postgres has gathered statistics for decades. You need to
>> bring some pretty substantial evidence to support that. The burden
>> of proof is on you, not on the status quo.
>>

Right. It's a good example of a "quick hack" fixing one particular case,
without considering the consequences on other cases too much. Good as a
starting point, but plenty of legwork to do.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
script.sql	application/sql	564 bytes

In response to

Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics at 2023-06-17 00:02:54 from Quan Zongliang

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2023-06-17 14:08:32	Re: run pgindent on a regular basis / scripted manner
Previous Message	jian he	2023-06-17 13:34:24	Re: Deleting prepared statements from libpq.