Re: FETCH FIRST clause PERCENT option

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Surafel Temesgen <surafel3000(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FETCH FIRST clause PERCENT option
Date: 2018-09-21 15:28:53
Message-ID: C33C2805-2664-4687-B554-3372157F0A59@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Sep 20, 2018, at 5:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2018-09-20 17:06:36 -0700, Mark Dilger wrote:
>> I should think that spilling anything to a tuplestore would only be needed
>> if the query contains an ORDER BY expression. If you query
>>
>> FETCH FIRST 50 PERCENT * FROM foo;
>>
>> you should just return every other row, discarding the rest, right? It's
>> only when an explicit ordering is given that the need to store the results
>> arises. Even with
>>
>> FETCH FIRST 50 PERCENT name FROM foo ORDER BY name;
>>
>> you can return one row for every two rows that you get back from the
>> sort node, reducing the maximum number you need to store at any time to
>> no more than 25% of all rows.
>
> I'm doubtful about the validity of these optimizations, particularly
> around being surprising. But I think more importantly, we should focus
> on the basic implementation that's needed anyway.

You may be right that getting the basic implementation finished first
is better than optimizing at this stage. So the rest of what I'm going
to say is just in defense of the optimization, and not an argument for
needing to optimize right away.

As for reducing the surprise factor, I think that it would be surprising
if I ask for a smallish percentage of rows and it takes significantly longer
and significantly more memory or disk than asking for all the rows takes.
If I'm including an explicit ORDER BY, then that explains it, but otherwise,
I'd be surprised. Note that I'm not saying I'd be surprised by it taking
roughly the same length of time / memory / disk. I'd only be surprised if
it took a lot more.

There are plenty of SQL generation engines that people put in their software.
I'd expect something like

sprintf("FETCH FIRST %d PERCENT %s FROM %s", percentage, columns, tablename)

to show up in such engines, and percentage to sometimes be 100. At least
in that case you should just return all rows rather than dumping them into
a tuplestore. Likewise, if the percentage is 0, you'll want to finish quickly.
Actually, I don't know if the SQL spec would require side effects to still
happen, in which case you'd still have to generate all rows for their side
effects to happen, and then just not return them. But still, no tuplestore.
So the implementation of FETCH FIRST would at least need to think about what
percentage is being requested, rather than just mindlessly adding a node to
the tree for storing everything, then computing the LIMIT based on the number
of rows stored, and then returning that number of rows.

mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2018-09-21 15:44:02 [patch] Bug in pg_dump/pg_restore using --no-publication
Previous Message Hubert Zhang 2018-09-21 14:21:16 Re: Proposal for disk quota feature