Linear vs. nonlinear planner cost estimates

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Linear vs. nonlinear planner cost estimates
Date: 2016-12-14 19:12:40
Message-ID: 31065.1481742760@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been looking into the problem reported here:
https://www.postgresql.org/message-id/CAOfJSTwzQ7Fx6Yjeg9mFkMsM5OVKPoa=EgkHceGdkr1TWg8vkA@mail.gmail.com

I can reproduce similar misbehavior with this test case:

create table updates as
select (-log(random()) * 10)::int as driver_id,
now() + x * '1 second'::interval as time,
random() as junk
from generate_series(1,15000000) x;

(I've intentionally set up this test data so that "driver_id" is randomly
ordered while the "time" column is perfectly in table order. While I
don't have direct evidence, it seems pretty plausible that the original
problem table might be somewhat like that.)

set maintenance_work_mem = '100MB';
set default_statistics_target = 10000;
vacuum analyze updates;
create index on updates(driver_id, time);

explain SELECT * FROM updates WHERE driver_id=1 ORDER BY "time" DESC LIMIT 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..0.73 rows=1 width=20)
-> Index Scan Backward using updates_driver_id_time_idx on updates (cost=0.56..470495.61 rows=2750660 width=20)
Index Cond: (driver_id = 1)
(3 rows)

That's all good, but if we create this less-well-adapted index, the
planner prefers it:

create index on updates(time);

explain SELECT * FROM updates WHERE driver_id=1 ORDER BY "time" DESC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.62 rows=1 width=20)
-> Index Scan Backward using updates_time_idx on updates (cost=0.43..522569.43 rows=2750660 width=20)
Filter: (driver_id = 1)
(3 rows)

Why did that happen? The filter can be expected to reject about four out
of every five tuples, so this plan should be expected to cost about five
times as much to get the first row. (If the driver_id values aren't
randomly located, things could be far worse; but the point here is that
even under the planner's assumption of uncorrelated row order, this should
not be considered a preferable plan.)

I think the reason is that, because the planner knows that using this
index will require a full-index scan, it's costing the updates_time_idx
plan on the basis of a full scan. Moreover, that index is perfectly
correlated with the physical table order, so it gets a low cost per fetch.
Evidently, the cost is coming out at about 522569/15000000 or 0.035 cost
units per tuple fetched. Meanwhile, the better plan is being costed with
the knowledge that it fetches only one-fifth of the table; and we also see
that that index has zero ordering correlation with the table. So the cost
per tuple fetched is coming out at about 0.171, which is enough more that
even having to fetch five tuples instead of one makes the poorer plan look
cheaper.

This is, of course, nonsense; we're costing the plans on the basis of
fetching many more tuples than will actually happen, and the cost
amortization effects that cost_index is accounting for will not happen.
So in short, we're taking a fundamentally nonlinear cost estimate from
cost_index and then scaling it linearly to try to account for the LIMIT.

I've known that this was a theoretical hazard of the way we do LIMIT
costing for some time (I recall bitching about it in my 2011 PGCon talk).
But this is the first case I've seen where it results in a wrong choice
in a relatively simple query.

I don't have a concrete proposal right now about how to fix this. The
most expansive response would be to decorate every path with an explicitly
nonlinear cost function, which would need to be able to report the cost
to fetch the first N tuples rather than the total scan cost. That would
be a lot of work though.

Maybe we could make this work just by setting the startup cost of an
indexscan differently. We could redefine "startup cost" as "cost to fetch
the first tuple and then stop", and compute that independently of the
total cost for plan types where it'd matter. That would fix the problem
for LIMIT 1 cases, and even for cases with a larger limit, scaling
linearly from that to the total cost would produce a better estimate than
what we get now. But it feels like it's still a band-aid.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-12-14 19:30:14 Re: background sessions
Previous Message Fujii Masao 2016-12-14 18:30:37 Re: Re: [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)