Re: nested loop semijoin estimates

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: nested loop semijoin estimates
Date: 2015-05-30 19:50:23
Message-ID: 30658.1433015423@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
> I wonder whether the
> run_cost += inner_run_cost;
> is actually correct, because this pretty much means we assume scanning
> the whole inner relation (once). Wouldn't something like this be more
> appropriate?
> run_cost += inner_run_cost * inner_scan_frac;

Well, not entirely. The reason why it's like that is explained (evidently
not well enough) in the comment just above:

* A complicating factor is that rescans may be cheaper than first
* scans. If we never scan all the way to the end of the inner rel,
* it might be (depending on the plan type) that we'd never pay the
* whole inner first-scan run cost. However it is difficult to
* estimate whether that will happen, so be conservative and always
* charge the whole first-scan cost once.

The case this is concerned about is something like a Materialize node
above a scan node. The Materialize node makes rescans much cheaper than
the original scan, but only once the data has been fetched to begin with.
So, while the first successful probe should indeed have cost something
like inner_run_cost * inner_scan_frac, once we make this change it's
no longer legitimate for final_cost_nestloop to suppose that the cost
of an unsuccessful probe is just inner_rescan_run_cost. Any unsuccessful
probe will mean we must incur all of inner_run_cost in order to complete
the underlying scan and fully load the Materialize node.

However, in the case where has_indexed_join_quals() is true, I think
your change is correct, because then all the scans will either stop
on the first tuple or quickly determine that no tuples satisfy the
indexquals. (Also, in the cases where has_indexed_join_quals() can
succeed, rescan cost is the same as initial scan cost anyhow.)

So what this seems to mean is that for SEMI/ANTI join cases, we have to
postpone all of the inner scan cost determination to final_cost_nestloop,
so that we can do this differently depending on whether
has_indexed_join_quals() is true. That's a little bit annoying because it
will mean we take the shortcut exit less often; but since SEMI/ANTI joins
aren't that common, it's probably not going to be a big planning time hit.

Not sure yet about your other point about the indexscan getting rejected
too soon. That doesn't seem to be happening for me, at least not in HEAD.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-05-30 20:47:27 Re: [CORE] postpone next week's release
Previous Message Peter Geoghegan 2015-05-30 19:46:31 Re: [CORE] postpone next week's release