| From: | Tender Wang <tndrwang(at)gmail(dot)com> |
|---|---|
| To: | Joel Jacobson <joel(at)compiler(dot)org> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Subject: | Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq |
| Date: | 2026-02-26 13:56:18 |
| Message-ID: | CAHewXN=ZaQnATM8YDfN4LHzycr2no=khGNyvLc3krray+=rc6w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Joel,
Joel Jacobson <joel(at)compiler(dot)org> 于2026年2月26日周四 04:32写道:
>
> On Wed, Feb 25, 2026, at 13:19, Joel Jacobson wrote:
> > On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> >> This bug seems to sometimes cause the wrong table, the larger table, to
> >> be hashed in a Hash Join, and the smaller table to be used for probing.
>
> To help reviewers, instead of relying on benchmark results,
> I realized it's much better if we can actually prove the calculation
> of skew_ratio is incorrect, and that it becomes correct with the fix.
>
> I therefore added debugging here:
>
> if (avgfreq > 0.0 && *mcv_freq > avgfreq)
> + {
> + elog(DEBUG1, "mcv_freq=%g, avgfreq=%g, skew_ratio=%g",
> + *mcv_freq, avgfreq, *mcv_freq / avgfreq);
> estfract *= *mcv_freq / avgfreq;
> + }
>
> And created this minimal schema to prove the incorrect calculation:
>
> CREATE TABLE orders (
> order_id bigint NOT NULL,
> tracking_code bigint,
> region int NOT NULL
> );
>
> CREATE TABLE shipments (
> shipment_id bigint NOT NULL,
> tracking_code bigint NOT NULL
> );
>
> -- Skewed tracking_code distribution:
> -- 10% of rows share a single "hot" tracking_code (value 1),
> -- the rest get unique codes.
> -- This ensures mcv_freq > avgfreq, triggering the skew adjustment
> -- even in the fixed version of estimate_hash_bucket_stats.
> INSERT INTO orders
> SELECT
> g,
> CASE
> WHEN g > 150000 THEN NULL
> WHEN g <= 15000 THEN 1
> ELSE g
> END,
> (g % 1000) + 1
> FROM generate_series(1, 200000) g;
>
> INSERT INTO shipments
> SELECT g, CASE WHEN g <= 15000 THEN 1 ELSE g END
> FROM generate_series(1, 150000) g;
>
> CREATE INDEX ON orders (region);
>
> ANALYZE orders;
> ANALYZE shipments;
>
> -- Ground truth of the skew_ratio for orders WHERE region = 42:
> WITH base AS (
> SELECT
> count(*)::numeric AS total,
> count(tracking_code)::numeric AS nonnull,
> count(DISTINCT tracking_code) AS ndistinct
> FROM orders
> WHERE region = 42
> ),
> mcv AS (
> SELECT count(*)::numeric AS mcv_count
> FROM orders
> WHERE region = 42 AND tracking_code IS NOT NULL
> GROUP BY tracking_code
> ORDER BY count(*) DESC
> LIMIT 1
> )
> SELECT
> round(mcv_count / total, 6) AS mcv_freq,
> round(nonnull / total / ndistinct, 6) AS avgfreq,
> round((mcv_count / total) / (nonnull / total / ndistinct), 6) AS skew_ratio
> FROM base, mcv;
>
> mcv_freq | avgfreq | skew_ratio
> ----------+----------+------------
> 0.075000 | 0.005515 | 13.600000
> (1 row)
>
> SET client_min_messages = DEBUG1;
>
> SELECT *
> FROM orders o
> JOIN shipments s ON s.tracking_code = o.tracking_code
> WHERE o.region = 42;
>
> -- HEAD (65707ed):
> DEBUG: mcv_freq=0.0748333, avgfreq=8.69725e-06, skew_ratio=8604.25
>
> -- 0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:
> DEBUG: mcv_freq=0.0738667, avgfreq=0.00871124, skew_ratio=8.47947
>
> In HEAD, the skew_ratio for orders.tracking_code is wrongly estimated
> to 8604.25, when the ground truth is 13.6, which the fixed estimate
> 8.47947, is a good approximation of.
>
> The fix only moves the computation of avgfreq from before the
> ndistinct adjustment, to after it:
>
> diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
> index 29fec65559..d19c4b5d96 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -4456,9 +4456,6 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
> else
> stanullfrac = 0.0;
>
> - /* Compute avg freq of all distinct data values in raw relation */
> - avgfreq = (1.0 - stanullfrac) / ndistinct;
> -
> /*
> * Adjust ndistinct to account for restriction clauses. Observe we are
> * assuming that the data distribution is affected uniformly by the
> @@ -4473,6 +4470,9 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
> ndistinct = clamp_row_est(ndistinct);
> }
>
> + /* Compute avg freq of all distinct data values in the filtered relation */
> + avgfreq = (1.0 - stanullfrac) / ndistinct;
> +
> /*
> * Initial estimate of bucketsize fraction is 1/nbuckets as long as the
> * number of buckets is less than the expected number of distinct values;
>
> /Joel
>
>
I think your analysis is correct.
After bd3e3e9, the mcv_freq is calculated by RelOptInfo.rows, which
accounts for restriction clauses.
But avgfreq is for the raw relation.
I tried another way. I used vardata.rel->tuples replacing with
original vardata.rel->rows.
I got the correct plan, and partition_join.sql had a lot plan diff as
your first thread said.
But vardata.rel->tuples may be zero due to an empty relation.
So I agree with your fix.
I added Tom to the cc list. He may know more about this.
--
Thanks,
Tender Wang
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andrew Dunstan | 2026-02-26 14:02:48 | Re: Non-text mode for pg_dumpall |
| Previous Message | Robert Haas | 2026-02-26 13:55:56 | Re: pg_plan_advice |