LIMIT in DECLARE CURSOR: request for comments

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: LIMIT in DECLARE CURSOR: request for comments
Date: 2000-10-27 16:18:51
Message-ID: 29402.972663531@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hiroshi and I had a discussion last night that needs to reach a wider
audience than just the bystanders on pgsql-committers. Let me see if
I can reconstruct the main points.

In 7.0, a LIMIT clause can appear in a DECLARE CURSOR, but it's ignored:

play=> select * from vv1;
f1
-------------
0
123456
-123456
2147483647
-2147483647
0
(6 rows)

play=> begin;
BEGIN
play=> declare c cursor for select * from vv1 limit 2;
SELECT
play=> fetch 10 from c;
f1
-------------
0
123456
-123456
2147483647
-2147483647
0
(6 rows)

The reason for this behavior is that LIMIT and the FETCH count are
implemented by the same mechanism (ExecutorRun's count parameter)
and so FETCH has no choice but to override the LIMIT with its own
argument.

Yesterday I reimplemented LIMIT as a separate plan node type, in order
to make it work in views. A side effect of this is that ExecutorRun's
count parameter is now *only* used for FETCH, and therefore a LIMIT
appearing in a DECLARE CURSOR does what IMHO it should do: you get
that many rows and no more from the cursor.

regression=# begin;
BEGIN
regression=# declare c cursor for select * from vv1 limit 2;
SELECT
regression=# fetch 10 from c;
f1
--------
0
123456
(2 rows)

Hiroshi was a little concerned about this change in behavior, and
so the first order of business is whether anyone wants to defend the
old way? IMHO it was incontrovertibly a bug, but ...

The second question is how the presence of a LIMIT clause ought to
affect the planner's behavior. In 7.0, we taught the planner to
pay attention to LIMIT as an indicator whether it ought to prefer
fast-start plans over lowest-total-cost plans. For example, consider

SELECT * FROM tab ORDER BY col;

and assume there's a b-tree index on col. Then the planner has two
possible choices of plan: an indexscan on col, or a sequential scan
followed by sort. The indexscan will begin delivering tuples right
away, whereas the sort has to finish the sequential scan and perform
the sort before it can deliver the first tuple. OTOH the total cost
to deliver the entire result is likely to be less for the sort plan
(let's assume for this discussion that it is). So for the above
query the planner should and will choose the sort plan. But for

SELECT * FROM tab ORDER BY col LIMIT 1;

it will choose the indexscan plan because of the low startup cost.
This is implemented by pricing a query that uses LIMIT on the basis
of linear interpolation between the startup and total costs, with the
interpolation point determined by the fraction of tuples we expect to
retrieve.

This is all pretty clear and seems to work OK for stand-alone SELECT.
But what about a DECLARE CURSOR? The planner has no way to know how
much of the cursor's result will actually be FETCHed by the user, so
it's not clear how to use all this shiny new LIMIT planning mechanism
for a DECLARE CURSOR.

What happens in 7.0 and current code is that for a DECLARE CURSOR,
the planner ignores any LIMIT clause and arbitrarily assumes that the
user will FETCH about 10% of the available data. Hence, the planning
is done on the basis of least "startup + 0.10*(total - startup)" cost.

Ignoring the limit clause was correct in 7.0, given the fact that the
limit wouldn't actually be used at runtime, but it's wrong now (unless
I'm beaten down on the semantics change). Also, the 10% estimate is
the sort of compromise that's likely to satisfy nobody --- if you intend
to fetch all the data, quite likely you want the least total cost,
whereas if you only want the first few rows, you probably want a plan
biased even more heavily towards startup cost at the expense of total
cost.

After thinking some more about yesterday's discussions, I propose that
we adopt the following planning behavior for cursors:

1. If DECLARE CURSOR does not contain a LIMIT, continue to plan on the
basis of 10%-or-so fetch (I'd consider anywhere from 5% to 25% to be
just as reasonable, if people want to argue about the exact number;
perhaps a SET variable is in order?). 10% seems to be a reasonable
compromise between delivering tuples promptly and not choosing a plan
that will take forever if the user fetches the whole result.

2. If DECLARE CURSOR contains a specific "LIMIT n" clause, plan on
the assumption that n tuples will be fetched. For small n this allows
the user to heavily bias the plan towards fast start. Since the LIMIT
will actually be enforced by the executor, the user cannot bias the
plan more heavily than is justified by the number of tuples he's
intending to fetch, however.

3. If DECLARE CURSOR contains "LIMIT ALL", plan on the assumption that
all tuples will be fetched, ie, select lowest-total-cost plan.

(Note: LIMIT ALL has been in the grammar right along, but up to now
it has been entirely equivalent to leaving out the LIMIT clause. This
proposal essentially suggests allowing it to act as a planner hint that
the user really does intend to fetch all the tuples.)

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lamar Owen 2000-10-27 16:34:12 Re: [GENERAL] 7.0 vs. 7.1 (was: latest version?)
Previous Message Larry Rosenman 2000-10-27 16:13:52 Re: Summary: what to do about INET/CIDR