Re: The science of optimization in practical terms?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: decibel <decibel(at)decibel(dot)org>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: The science of optimization in practical terms?
Date: 2009-02-18 05:23:37
Message-ID: 603c8f070902172123m3c424e9bxcf46ad4e4b380bf0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Actually, a simple algorithm that might work really well would be to
> calculate relation cache odds as ( number of page accesses for relation /
> number of page accesses for all relations ) * ( sum(relpages)*BLKSZ /
> eff_cache_size ), where number of page accesses would be both from relcache
> and not.

I don't think that formula makes any sense. If effective_cache_size
is in the denominator, then increasing it will make the odds of
finding the page in cache go down.

> One thing this doesn't address though is the report from a few
> months ago that accessing small tables is still faster with an index scan,
> even if we know the whole thing is in cache (I don't remember if that was
> ever resolved...)

I'm not sure if this is what you're referring to, but there was a
relatively recent post on, I believe, -performance, where a bitmap
index scan that hit almost the entire table beat out a seqscan. I
don't think there was any further discussion and I'm still mystified
as to how it's possible.

> Another idea would be to look at an efficient way to measure how long it
> actually takes to pull data from the OS. This has been suggested in the
> past, but the idea there was to measure every block access, and the concern
> was the overhead of the timing calls. But what if we sampled instead? Or,
> what if we read multiple blocks at one time in the cases where we knew we'd
> need to (seqscan and an index scan needing more than one tuple). Another
> option would by an async IO process that is responsible for measuring this
> stuff; I believe some people have played with async IO and gotten good
> results.
>
> Of course, on dtrace platforms we could just plug into dtrace...
>
>> You might also run into
>> problems with relations that have "hot" segments that are accessed
>> frequently and stay cached, and "cold" segments that are never
>> touched: if 20% of the relation is in cache, but that's the only 20%
>> of the relation we ever access, then our hit rate will be 100% rather
>> than 20%.
>
> Yes, but that would be accurate :)

No, we'd predict the hit rate to be 20%, but the real hit rate would be 100%.

> In reality, I think we need to re-visit the idea of evaluating how close a
> chosen query plan is matching reality as we're running. If we thought we'd
> be seeing a 100% hit rate but in reality it's much lower we could re-plan
> (of course this probably only makes sense for queries that take many
> seconds).

I don't think it's going to be very practical to re-plan the query in
its entirety, because then you'd have to somehow undo all of the work
you'd done thus far (including side effects, if any), which might not
be possible and certainly isn't easy. What might be practical is to
bail out of a nested loop that turns out to iterate more times than
expected and hash the inner rel, or even sort the remaining portion of
the outer rel and the entire inner rel and then merge-join them. The
problem is that these sorts of changes can alter the order in which
results are generated, and if the parent node is something like a
merge-join that needs the results to be ordered in a particular way,
then you've got a problem. Furthermore, it's not easy to decide when
to switch strategies. If you decide to switch to a hash join just
prior to what would have been the last iteration of the nested loop,
you lose.

I'm interested to know whether anyone else shares my belief that
nested loops are the cause of most really bad plans. What usually
happens to me is that the planner develops some unwarranted optimism
about the number of rows likely to be generated by the outer side of
the join and decides that it's not worth sorting the inner side or
building a hash table or using an index, and that the right thing to
do is just rescan the inner node on every pass. When the outer side
returns three or four orders of magnitude more results than expected,
ka-pow!

Another approach to this problem might be to try to make the planner a
little more cautious about choosing nested loops in the first place.
Given a column a with the values 1 .. 10^6, the planner estimates the
number of rows for a = X as 1, a in (X1..Xn) as n, a not in (X1..Xn)
AS 10^6-n, and a < X for all X < 100 as 100. These are all pretty
reasonable estimates (one could hope to get a different result for a <
5 than a < 100). But as soon as you use some operation that the
planner knows nothing about, the guesses get really bad:

CREATE TABLE big (id serial, x text);
INSERT INTO big (x) SELECT random() FROM generate_series(1,1000000);
ANALYZE;
EXPLAIN SELECT * FROM big WHERE id % 2 = 0 AND (id + 0) % 2 = 0 AND
(id - 0) % 2 = 0;

QUERY PLAN
------------------------------------------------------------------------------
Seq Scan on big (cost=0.00..36375.00 rows=1 width=22)
Filter: (((id % 2) = 0) AND (((id + 0) % 2) = 0) AND (((id - 0) % 2) = 0))

The fact that the selectivity of an unknown expression is arbitrarily
set to 0.005 is a defensible choice that doesn't happen to match my
experience, but the idea that with three unknown expressions the
selectivity is now (0.005)^3 = 0.000000125 seems hard to justify.
That's a sufficiently small selectivity that it will often result in
nestloop plans, and they tend not to be good when the real selectivity
turns out to be, say, 0.1.

>> But even a primitive algorithm would probably be a lot better than
>> what we have now. I'm guessing that there are a lot of databases where
>> either the whole database fits in cache, or a decent chunk of
>> relatively small core relations fit in cache and then there are some
>> big or infrequently-used ones that don't.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kris Jurka 2009-02-18 05:43:38 Re: [Pljava-dev] Re: Should creating a new base type require superuser status?
Previous Message Tom Lane 2009-02-18 01:34:45 Re: [BUGS] BUG #4660: float functions return -0