Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

From: Tender Wang <tndrwang(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joel Jacobson <joel(at)compiler(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
Date: 2026-02-27 23:57:22
Message-ID: CAHewXNknzZ0+trb0RzvTBpzr9R+42W0xLYzDDsueJ11N1mbqNg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> 于2026年2月28日周六 03:15写道:
>
> Tender Wang <tndrwang(at)gmail(dot)com> writes:
> > I added Tom to the cc list. He may know more about this.
>
> Hmm, git blame says I originated this function 25 years ago
> (f905d65ee), but I don't claim to remember that.
>
> Looking at it now, though, I think that bd3e3e9e5 is indeed
> wrong but not in the way Joel suggests. The longstanding
> way to compute mcv_freq is
>
> /*
> * The first MCV stat is for the most common value.
> */
> if (sslot.nnumbers > 0)
> *mcv_freq = sslot.numbers[0];
>
> *This number is a fraction measured on the raw relation.*
> (Necessarily so, because it's just a number computed by ANALYZE.)
> Then bd3e3e9e5 added
>
> /*
> * If there are no recorded MCVs, but we do have a histogram, then
> * assume that ANALYZE determined that the column is unique.
> */
> if (vardata.rel && vardata.rel->rows > 0)
> *mcv_freq = 1.0 / vardata.rel->rows;
>
> This is a pure thinko. rel->rows is the estimated number
> of filtered rows. What I should have used is rel->tuples,
> which is the estimated raw relation size, so as to get a
> number that is commensurate with the longstanding way
> of calculating mcv_freq. Then that also matches up with
> computing avgfreq on the raw relation.
>
> So I think the correct fix is basically
>
> - if (vardata.rel && vardata.rel->rows > 0)
> - *mcv_freq = 1.0 / vardata.rel->rows;
> + if (vardata.rel && vardata.rel->tuples > 0)
> + *mcv_freq = 1.0 / vardata.rel->tuples;
>

Yeah, in my last email, I said I tried this way. But I worried that
rel->tuples may be zero for an empty relation.

> and I wonder if that will wind up in reverting a lot of the plan
> choice changes seen in bd3e3e9e5.

Yes, a lot plan diff in partition_join.sql.

--
Thanks,
Tender Wang

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-02-28 00:12:03 Re: doc: Clarify that empty COMMENT string removes the comment
Previous Message Melanie Plageman 2026-02-27 23:56:48 Re: All-visible pages with valid prune xid are confusing