Re: [PATCH] Equivalence Class Filters

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Equivalence Class Filters
Date: 2015-12-09 02:14:54
Message-ID: CA+TgmoaV7ZgaxJUQ6U5afjRv1QGi67fCGmJ5sJ4B3dDQG+1BuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 7, 2015 at 8:26 PM, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> The biggest frustration for me is that the way Tom always seems to argue his
> point it's as if planning time is roughly the same or more expensive than
> execution time, and likely that's true in many cases, but I would imagine in
> more normal cases that execution time is longer, although I've never had the
> guts to stand up and argued this as I don't have any stats to back me up.

I think this is missing the point. It is possible to have a query
planner optimization that is so expensive that it loses even when it
improves the plan, but I don't think this optimization falls into that
category. This particular change would not have been requested as
many times as it has been if people didn't keep rediscovering that, on
a certain class of queries, it works *really* well. The problem that
I understand Tom to be concerned about is the queries where the
optimization consumes additional planning time without delivering a
better plan - and those where, without careful thought, it might even
deliver a worse plan.

Now, I'm not sure Tom is right about that, but he might be. The class
of queries we're talking about here are the ones where somebody
constrains a column that is part of an equivalence class with multiple
members. Perhaps the best example is a large join, let's say 10
tables, where the join column is the same for all tables; thus, we
have an equivalence class with 10 members. Suppose further we have an
inequality condition applied to one member of that equivalence class.

Currently, we'll apply that inequality to the table that the user
actually mentioned and ignore all the others; in theory, it could be
right to enforce that inequality condition on any nonempty subset of
the 10 tables we've got. It might be right to pick exactly one of
those tables and enforce the inequality there, or it might be right to
enforce it on some of them, or it might be right to enforce it on all
of them. The reason why the answer isn't automatically "all of them"
is because, first of all, it's possible that enforcing the condition
at a particular table costs more to filter out the rows that we save
in execution time at higher levels of the plan tree. For example,
consider A JOIN B ON A.X = B.X WHERE A.X > 1000000. It might be that
the range of A.X is [0,1000001] but the range of B.X is
[1000000,2000000]; so enforcing the inequality against A is very
selective but enforcing it against B filters out basically nothing.
Furthermore, there are some cases involving parameterized paths where
enforcing the inequality multiple times is definitely bad: for
example, if we've got a nested loop where the outer side is a seq scan
that enforces the condition and the inner side is an index probe, it
is just a waste to retest it on the inner side. We already know that
the outer row passes the inequality, so the inner row will necessarily
pass also. This doesn't apply to merge or hash joins, and it also
doesn't apply to all nested loops: scans that aren't paramaterized by
the equivalence-class column can still benefit from separate
enforcement of the inequality.

Considering the factors mentioned in the previous paragraph, it seems
likely to me that a naive patch that propagates the inequalities to
every relation where they could hypothetically be enforced will be a
significant loss in some cases even just in execution cost, ignoring
completely the effects on planning time. And, on the flip side,
figuring out what non-empty subset of relations forms the optimal set
upon which to enforce the inequality seems like a complicated problem.
A first cut might be to enforce the inequality against the relation
where it's believed to be most selective, and to also enforce it in
paths for other relations that aren't parameterized by the
equivalence-class column mentioned in the inequality provided that the
selectivity is thought to be above some threshold ... but I'm not sure
this is very principled, and it's certainly not obvious what arbitrary
threshold would be appropriate. The threshold where multiple
enforcement of the qual starts to pay off also depends on the cost of
the qual: an expensive qual has to filter out more rows than a cheap
qual to be worthwhile. I'm not sure how to estimate all this, but I
don't have trouble believing that if not done carefully it could
either (a) cost a lot of planning time or (b) generate lousy plans.

Now, all that having been said, I think this is a pretty important
optimization. Lots of people have asked for it, and I think it would
be worth expending some brainpower to try to figure out a way to be
smarter than we are now, which is, in a nutshell, as dumb as possible.
You're completely right that, on an OLAP query that's going to run for
a minute, a few extra milliseconds of planning time could be totally
OK even if they don't yield any benefit on most queries. Maybe we can
get the cost of this feature down to the point where we can turn it on
for everybody and have that be OK. But if not, having something we
can turn on for queries that are going to run for a long time, or that
we estimate are going to run for a long time, would, I think, be
useful. As things stand, planning time can consume a very significant
percentage of total runtime on OLTP queries, and if you are processing
300k queries/second and 10% of that is planning time, and somebody
boosts planning time by 1%, that's an 0.1% hit overall and you just
lost several hundred queries per second. That's not nothing,
especially if we do 3 or 4 such "optimizations" per release cycle.
But if the optimization can be made nearly free in cases where it
doesn't apply, that's a different story, and of course OLAP is a whole
different ball game.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-09 02:18:29 Re: More stable query plans via more predictable column statistics
Previous Message Haribabu Kommi 2015-12-09 01:18:18 Re: pg_hba_lookup function to get all matching pg_hba.conf entries