Re: [HACKERS] What about LIMIT in SELECT ?

From: jwieck(at)debis(dot)com (Jan Wieck)
To: eric(at)linux-hw(dot)com (Eric Lee Green)
Cc: jeff(at)remapcorp(dot)com, hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] What about LIMIT in SELECT ?
Date: 1998-10-14 11:09:21
Message-ID: m0zTOnx-000EBRC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs pgsql-hackers

Eric Lee Green wrote:
>
> On Tue, 13 Oct 1998, Jeff Hoffmann wrote:
> > >I agree completely, LIMIT would be VERY usefull in web based apps, which
> > >is all I run. It does not matter to me if it is not part of a formal
> > >standard. The idea is so common that it is a defacto standard.
> >
> > i'm not familiar with mysql and using "LIMIT" but wouldn't this same effect
> > be achieved by declaring a cursor and fetching however many records in the
> > cursor? it's a very noticeable improvement when you only want the first 20
> > out of 500 in a 200k record database, at least.
>
> The problem with declaring a cursor vs. the "LIMIT" clause is that the
> "LIMIT" clause, if used properly by the database engine (along with the
> database engine using indexes in "ORDER BY" clauses) allows the database
> engine to short-circuit the tail end of the query. That is, if you have 25
> names and the last one ends with BEAVIS, the database engine doesn't have
> to go through the BUTTHEADS and KENNYs and etc.
>
> Theoretically a cursor is superior to the "LIMIT" clause because you're
> eventually going to want the B's and K's and etc. anyhow -- but only in a
> stateful enviornment. In the stateless web environment, a cursor is
> useless because the connection can close at any time even when you're
> using "persistent" connections (and of course when the connection closes
> the cursor closes).

I'm missing something. Well it's right that in the stateless
web environment a cursor has to be declared and closed for
any single CGI call. But even if you have a LIMIT clause,
your CGI must know with which key to start.

So your query must look like

SELECT ... WHERE key > 'last processed key' ORDER BY key;

And your key must be unique (or at least contain no duplicate
entries) or you might miss some rows between the pages (have
100 Brown's in the table and last processed key was a Brown
while using LIMIT).

In postgres you could actually do the following (but read on
below - it's not optimized correct)

BEGIN;
DECLARE c CURSOR FOR SELECT ... WHERE key > 'last' ORDER BY key;
FETCH 20 IN c;
(process the 20 rows in CGI)
CLOSE c;
COMMIT;

Having LIMIT looks more elegant and has less overhead in CGI-
backend communication. But the cursor version is SQL
standard and portable.

>
> I wanted very badly to use PostgreSQL for a web project I'm working on,
> but it just wouldn't do the job :-(.

I've done some tests and what I found out might be a bug in
PostgreSQL's query optimizer. Having a table with 25k rows
where key is a text field with a unique index. Now I used
EXPLAIN for some queries

SELECT * FROM tab;

results in a seqscan - expected.

SELECT * FROM tab ORDER BY key;

results in a sort->seqscan - I would have
expected an indexscan!

SELECT * FROM tab WHERE key > 'G';

results in an indexscan - expected.

SELECT * FROM tab WHERE key > 'G' ORDER BY key;

results in a sort->indexscan - hmmm.

These results stay the same even if I blow up the table by
duplicating all rows (now with a non-unique index) to 100k
rows and have them presorted in the table.

Needless to say that everything is vacuum'd for statistics.

The last one is the query we would need in the web
environment used over a cursor as in the example above. But
due to the sort, the backend selects until the end of the
table, sorts them and then returns only the first 20 rows
(out of sorts result).

This is very painful if the qualification (key > ...) points
to the beginning of the key list.

Looking at planner.c I can see, that if there is a sortClause
in the parsetree, the planner creates a sort node and does
absolutely not check if there is an index that could be used
to do it. In the examples above, the sort is absolutely
needless because the index scan will already return the
tuples in the right order :-).

Somewhere deep in my brain I found a statement that sorting
sorted data isn't only unnecessary (except the order
changes), it is slow too compared against sorting randomly
ordered data.

Can we fix this before 6.4 release, will it be a past 6.4 or
am I doing something wrong here? I think it isn't a fix (it's
a planner enhancement) so it should really be a past 6.4
item.

For now, the only possibility is to omit the ORDER BY in the
query and hope the planner will always generate an index scan
(because of the qualification 'key > ...'). Doing so I
selected multiple times 20 rows (with the last key qual like
a CGI would do) in separate transactions. Using cursor and
fetch speeds up the access by a factor of 1000! But it is
unsafe and thus NOT RECOMMENDED! It's only a test if cursors
can do the LIMIT job - and they could if the planner would do
a better job.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck(at)debis(dot)com (Jan Wieck) #

In response to

Responses

Browse pgsql-docs by date

  From Date Subject
Next Message Oleg Bartunov 1998-10-14 12:53:53 Re: [HACKERS] What about LIMIT in SELECT ?
Previous Message Matthew N. Dodd 1998-10-14 06:32:30 Re: [HACKERS] What about LIMIT in SELECT ?

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 1998-10-14 12:53:53 Re: [HACKERS] What about LIMIT in SELECT ?
Previous Message Jan Wieck 1998-10-14 07:51:36 Re: [HACKERS] PostgreSQL v6.4 BETA2 ...