Re: Weirdly pesimistic estimates in optimizer

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Kubečka <kubecka(dot)dav(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Weirdly pesimistic estimates in optimizer
Date: 2015-03-06 16:58:42
Message-ID: 21909.1425661122@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I chewed on this for awhile and decided that there'd be no real harm in
> taking identification of the unique expressions out of
> create_unique_path() and doing it earlier, in initsplan.c; we'd need a
> couple more fields in SpecialJoinInfo but that doesn't seem like a
> problem. However, rel->rows is a *big* problem; we simply have not made
> any join size estimates yet, and can't, because these things are done
> bottom up.

> However ... estimate_num_groups's dependency on its rowcount input is not
> large (it's basically using it as a clamp). So conceivably we could have
> get_loop_count just multiply together the sizes of the base relations
> included in the semijoin's RHS to get a preliminary estimate of that
> number. This would be the right thing anyway for a single relation in the
> RHS, which is the most common case. It would usually be an overestimate
> for join RHS, but we could hope that the output of estimate_num_groups
> wouldn't be affected too badly.

Attached is a draft patch that does those two things. I like the first
part (replacing SpecialJoinInfo's rather ad-hoc join_quals field with
something more explicitly attuned to semijoin uniqueness processing).
The second part is still pretty much of a kluge, but then get_loop_count
was a kluge already. This arguably makes it better.

Now, on the test case you presented, this has the unfortunate effect that
it now reliably chooses the "wrong" plan for both cases :-(. But I think
that's a reflection of poor cost parameters (ie, test case fits handily in
RAM but we've not set the cost parameters to reflect that). We do get the
same rowcount and roughly-same cost estimates for both the random_fk_dupl
and random_fk_uniq queries, so from that standpoint it's doing the right
thing. If I reduce random_page_cost to 2 or so, it makes the choices you
wanted.

regards, tom lane

Attachment Content-Type Size
adjust-loop-count-for-semijoins.patch text/x-diff 39.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2015-03-06 17:19:36 Re: pg_upgrade and rsync
Previous Message Tom Lane 2015-03-06 16:48:40 Re: Clamping reulst row number of joins.