Re: Weirdly pesimistic estimates in optimizer

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Kubečka <kubecka(dot)dav(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Weirdly pesimistic estimates in optimizer
Date: 2015-03-06 05:16:19
Message-ID: 32059.1425618979@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

=?UTF-8?Q?David_Kube=C4=8Dka?= <kubecka(dot)dav(at)gmail(dot)com> writes:
> There is division by loop_count because of predicted effect of caching and
> it is exactly this division which makes the run_cost for single index
> lookup so low compared with the query version with random_fk_uniq. So the
> main problem is that the planner calls cost_index with loop_count equal to
> number of rows of inner table, *but it ignores semi-join semantics*, i.e.
> it doesn't account for the number of *unique* rows in the inner table which
> will actually be the number of loops.

Good point.

> Since this is my first time looking into the optimizer (and in fact any
> postgres) code I am not yet able to locate the exact place where this
> should be repaired, but I hope that in few days I will have a patch :-)

Hm, this might not be the best problem to tackle for your first Postgres
patch :-(. The information about the estimated number of unique rows
isn't readily available at the point where we're creating parameterized
paths. When we do compute such an estimate, it's done like this:

pathnode->path.rows = estimate_num_groups(root, uniq_exprs, rel->rows);

where uniq_exprs is a list of right-hand-side expressions we've determined
belong to the semijoin, and rel->rows represents the raw size of the
semijoin's RHS relation (which might itself be a join). Neither of those
things are available to create_index_path.

I chewed on this for awhile and decided that there'd be no real harm in
taking identification of the unique expressions out of
create_unique_path() and doing it earlier, in initsplan.c; we'd need a
couple more fields in SpecialJoinInfo but that doesn't seem like a
problem. However, rel->rows is a *big* problem; we simply have not made
any join size estimates yet, and can't, because these things are done
bottom up.

However ... estimate_num_groups's dependency on its rowcount input is not
large (it's basically using it as a clamp). So conceivably we could have
get_loop_count just multiply together the sizes of the base relations
included in the semijoin's RHS to get a preliminary estimate of that
number. This would be the right thing anyway for a single relation in the
RHS, which is the most common case. It would usually be an overestimate
for join RHS, but we could hope that the output of estimate_num_groups
wouldn't be affected too badly.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2015-03-06 06:01:14 Re: Combining Aggregates
Previous Message Michael Paquier 2015-03-06 04:07:04 Re: Table-level log_autovacuum_min_duration