From: | "Markus Stocker" <markus(at)wilabs(dot)ch> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Explain explained |
Date: | 2008-03-04 20:52:30 |
Message-ID: | a9dfaf710803041252u47a623f5yce1ab9af5a39e4cb@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
Tom,
Thanks for your inputs, very enlightening!
On Tue, Mar 4, 2008 at 1:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Markus Stocker" <markus(at)wilabs(dot)ch> writes:
> > 1/ How does postgresql know that the constrain individual_name.name =
> > 'http://www.University0.edu' matches one row (which is in fact true)?
>
> It doesn't *know* that, except in the special case where it sees there's
> a unique index on that column.
Indeed, there is a unique index on that column, sorry about that. In
fact, pg_stats says that there are no MCVs for that column
(obviously). So that explains a lot. I get
sel(individual_name.name = 'http://www.University0.edu')
= (1 - null_frac) / num_distinct
= (1 - 0) / 17174
rows
= rel_cardinality * sel
= 17174 * 1 / 17174
which I suppose explains too why the expected number of rows is 1.
> There's a new section in the 8.3 docs that gives some more details about the estimation process:
> http://www.postgresql.org/docs/8.3/static/planner-stats-details.html
Thanks for pointing to this, very well documented.
> > Further, why expects postgresql
> > for each operation a row size of 1?
>
> It's not expecting more than one join partner at each step. Again
> that's just statistical.
Sounds reasonable. The first constrain (i.e. individual_name.name)
returns (estimated) 1 row (see above) hence successive joins can't
have more than one.
> > 2/ Sequential scans seem to me more expensive compared to index scans.
> > I'm wondering why the sequential scan on individual_name is the first
> > executed in the plan.
>
> I was wondering that too; it looks like it should be a candidate for an
> index search. Datatype problem maybe? Again, you've not shown us the
> table definitions...
I guess this is explained too now, at least partially. If I drop the
index on individual_name.name there is still pg_stats telling that
there are no MCVs. Hence, the expected number of rows doesn't change
(even after analyze, the values simply are unique). Not sure how the
sequential scan influences the overall costs compared to the index
scan, though.
The schema for individual_name relation is
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
name | text | not null
Indexes:
"individual_name_pkey" PRIMARY KEY, btree (id)
"individual_name_name_idx" UNIQUE, btree (name)
whereas the schema for the concept_assertion relation is
Column | Type | Modifiers
------------+---------+-----------
concept | integer | not null
individual | integer | not null
Indexes:
"concept_assertion_pkey" PRIMARY KEY, btree (concept, individual)
"concept_assertion_concept_idx" btree (concept)
Foreign-key constraints:
"concept_assertion_concept_fkey" FOREIGN KEY (concept) REFERENCES
tbox_name(id)
"concept_assertion_individual_fkey" FOREIGN KEY (individual)
REFERENCES individual_name(id)
> > 3/ There is one condition in the query, i.e. concept_assertion.concept
> > = 5 with an empty result set, i.e. selectivity 0. In fact, the last
> > index scan on concept_assertion ca_1 in the plan is never executed
> > (this is what 'explain analyze' tells me). I'm wondering, why this
> > constrain is not executed first. By executing this first, we could
> > just never execute everything else.
>
> Postgres never uses a rowcount estimate of less than one row,
> so it's not going to try to optimize that way. There are various
> rationales behind that choice, but the main one is that we don't
> trust the statistics unreservedly. The odds that an estimate of zero
> is more accurate than an estimate of one just aren't good enough,
> and the likelihood of generating a really awful plan if we did believe
> zero rows is too high.
Agreed, makes a lot of sense. If I try to calculate the estimated rows
for concept_assertion.concept = 5 (i.e. the constrain which
effectively returns an empty result set) I get
sel(concept_assertion.concept = 5)
= (1 - sum(mvf)) / (n_distinct - num_mcv)
= 0.021545454
rows
= rel_cardinality * sel
= 18128 * 0.021545454
= (rounded) 391
Clearly, at least by inspecting the two constraints individually, i.e.
the individual_name.name = 'http://www.University0.edu' and the
concept_assertion.concept = 5 constrain, the optimizer chooses the
first as it returns less expected rows, i.e. 1 vs. 391.
Hope the explanations sound somewhat reasonable.
Thanks again,
markus
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-03-04 22:30:31 | Re: Explain explained |
Previous Message | Tom Lane | 2008-03-04 18:26:17 | Re: Explain explained |