Re: Adaptive query optimization

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adaptive query optimization
Date: 2019-06-11 21:43:31
Message-ID: 20190611214331.eww5yot6lytomjcn@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alexander,

Thanks for starting this thread. I've had similar ideas in the past and
even hacked together something (very dirty), so it's great someone else
is interested in this topic too.

On Mon, Jun 10, 2019 at 11:53:02AM +0300, Konstantin Knizhnik wrote:
>Hi,
>
>Inefficiency of Postgres on some complex queries in most cases is
>caused by errors in selectivity estimations.
>Postgres doesn't take in account correlation between columns unless
>you explicitly create mutlicolumn statistics
>(but multicolumn statistic is used only for restriction clauses, not
>for join selectivity, where estimation errors are most critical).
>

Yes, that's the current status. I'd say we want to allow using
multicolumn stats to join estimation too, but that's a hard problem.

>Certainly it is possible to collect more statistics and improve
>estimation formulas but c'est la vie is that estimation of relation
>size after several joins //more looks like an exercise in guesswork.

I'd go even further - it's a simple fact we can't have perfect stats
that would give us "sufficiently good" estimates for all common data
distributions and clauses.

Firstly, stats are a merely a simplified representation of the overall
data distribution - which makes them small, but eliminates some details
(which may be quite important for highly non-uniform distributions).

Secondly, we only have a number of generic stats types (MCV, histogram,
...) but that may not be sufficient to "capture" the imporant aspects of
the data distribution.

And finally, we only know how to use those stats for specific types of
clauses (equality, inequality, ...) with very simple expressions. But
that's often not what the users do.

I think adaptive query optimization - in the sense of collecting data
from query executions and and leverating that when planning future
queries - can (hopefully) help with all those challenges. At least in
some cases.

>This is why alternative approach based on adaptive query optimization
>seems to be more promising. When we analyze query execution with
>EXPLAIN ANALYZE, we can see actual number of rows for each plan node.
>We can use this information to adjust clause selectivity and reduce
>estimation error.
>

Yep, that's roughly the idea. I don't think we need EXPLAIN ANALYZE, it
should be enough to instrument queries to collect row counts on the fly.
But I guess that's mostly what the explain_analyze changes do.

>At PGconf 2017 my former colleague Oleg Ivanov made presentation about
>using machine learning for AQO:
>https://www.pgcon.org/2017/schedule/events/1086.en.html
>Right now this project is available from PostgresPro repository:
>https://github.com/postgrespro/aqo
>
>There are several problems with this approach:
>1. It requires "learning phase"

I don't think "learning phase" is an issue, in fact I think that's
something we need to do - it ensures we have enough data to make good
decisions.

>2. It saves collected data in Postgres tables, which makes read-only
>transaction executing only queries to become read-write transaction,
>obtaining XID...

Yeah, that's an issue because it makes it useless on standbys etc. I
think it'd be enough to do something similar to pg_stat_statements, i.e.
store it in memory and flush it to disk once in a while.

>3. It doesn't take in account concrete values of literals used in
>clauses, so it is not able to address data skews.

Yep. I don't think it's necessarily an issue with all approaches to
adaptive optimization, though. But I agree we should detect both
systemic estimation issues, and misestimates for particular parameter
values. I think that's doable.

>4. Machine learning  can be quite  expensive and seems to be overkill
>if we want just to adjust selectivities according to actual number of
>affected rows.
>

I think that depends - some machine learning approaches are not that
bad. But I think there's a more serious issue - explainability. We need
a solution where we can explain/justify why it makes some decisions. I
really don't want a black box that produces numbers that you just need
to take at face value.

The good thing is that the simpler the method, the less expensive and
more explainable it is.

>I tried to create much simpler version of AQO based on auto_explain
>extension.
>This extension provide all necessary infrastructure to analyze
>statements with long execution time.
>I have added two new modes to auto_explain:
>1. Auto generation of multicolumn statistics for variables using in
>clauses with large estimation error.

Interesting! I probably wouldn't consider this part of adaptive query
optimization, but it probably makes sense to make it part of this. I
wonder if we might improve this to also suggest "missing" indexes?

>2. Direct adjustment of estimated number of rows based on information
>collected by EXPLAIN ANALYZE.
>

Yep!

>As well as in Oleg's implementation, it requires few changes  in
>Postgres core: introducing some new hooks for relation size
>estimation.
>But most of functionality is implemented in auto_explain extension.
>Attached please find patch to vanilla.
>Please read Readme.ms file for more details.
>
>I have tested it on join order benchmark JOB
>https://github.com/gregrahn/join-order-benchmark
>aqo.svg contains results of applying my and Oleg's versions of AQO to
>JOB queries. First result corresponds to the vanilla Postgres, second
>- my AQO keeping literal values, third my AQO ignoring literal values
>and last one result of Oleg's machine learning (after 10 iterations).
>
>The principle problem with AQO approach is that using provided explain
>feedback we are able to adjust selectivities only for one particular
>plan.
>But there may be many other alternative plans, and once we adjust one
>plan, optimizer most likely choose some other plan which actually can
>be ever worser than
>original plan. Certainly if we continue learning, then sooner or later
>we will know real selectivities for all possible clauses. But number
>of possible plans can be very
>large for queries with many joins (factorial), so many iterations may
>be required. What is worser some intermediate bad plans can take huge
>amount of time.
>Particularly sixth iteration of Oleg's AQO on JOB queries set takes
>about two hours (instead of original 10 minutes!).
>Such thing doesn't happen with my AQO, but it seems to be just matter
>of luck.
>

Right. But I think I might have an idea how to address (some of) this.

As I already mentioned, I was experimenting with something similar,
maybe two or three years ago (I remember chatting about it with Teodor
at pgcon last week). I was facing the same issues, and my approach was
based on hooks too.

But my idea was to not to track stats for a plan as a whole, but instead
decompose it into individual nodes, categoried into three basic groups -
scans, joins and aggregations. And then use this extracted information
to other plans, with "matching" nodes.

For example, let's consider a simple "single-scan" query

SELECT * FROM t1 WHERE a = ? AND b = ? AND c < ?;

Now, if you execute this enought times (say, 100x or 1000x), tracking
the estimates and actual row counts, you may then compute the average
misestimate (maybe a geometric mean would be more appropriate here?):

AVG(actual/estimate)

and if this is significantly different from 1.0, then we can say there's
a systemic misestimate, and we can use this as a correction coefficient
when computing the scan estimate. (And we need to be careful about
collection new data, because the estimates will include this correction.
But that can be done by tracking "epoch" of the plan.)

Now, if someone uses this same scan in a join, like for example

SELECT * FROM t1 JOIN t2 ON (t1.id = t2.id)
WHERE (t1.a = ? AND t1.b = ? AND t1.c < ?)
AND (t2.x = ? AND t2.y = ?)

then we can still apply the same correction to the t1 scan (I think).
But then we can also collect data for the t1-t2 join, and compute a
correction coefficient in a similar way. It requires a bit of care
because we need to compensate for misestimates of inputs, but I think
that's doable.

Of course, all this is rather speculative, and I never got to anything
beyond a very simple PoC. So I hope it makes at least some sense.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2019-06-11 22:22:56 Re: tableam: abstracting relation sizing code
Previous Message Andres Freund 2019-06-11 21:07:29 Re: openssl valgrind failures on skink are due to openssl issue