[WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR

From: Dent John <denty(at)QQdd(dot)eu>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Roman Pekar <roma(dot)pekar(at)gmail(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR
Date: 2019-09-17 20:06:08
Message-ID: B2AFCAB5-FACD-44BF-963F-7DD2735FAB5D@QQdd.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi folks,

Prompted originally by a post by Roman Pekar [1], I wanted to share a revised version of a patch that allows REFCURSOR results to be consumed as data in a regular SQL query as well as my thoughts on how to improve the area as a whole.

In order to be clear about the purpose and how I see it fitting into a broader context, I’ve started a new thread and I’d welcome discussion about it.

Background
----------

The ambition of this contribution is to make PostgreSQL able to efficiently support procedural language functions that either produce or consume sets (or both).

PostgreSQL already has some support for this in functions that return SETOFs. However, as my review [3] identified, there are some gaps in PostgreSQL’s current capability, as well as scope for extension to improve its overall capability.

This first patch addresses only a small part of the overall ambition, but I wanted to share both the patch and the overall ambition as work in progress, and I’d welcome comments on both. (The patch is still based on 12beta2.)

Problems to be solved
---------------------

1. Allow procedural languages (PLs) to efficiently consume sets of records

PostgreSQL does allow PL functions to consume sets, however it does so be feeding records to the function, one row per function invocation. REFCURSORs, however can be supplied as input parameters and their content consumed by the function as it wishes, but only if the PL supports the REFCURSOR concept.

Typically, PLs do allow SQL queries to be executed within the PL function [5, 6, 7]. However REFCURSOR results cannot be effectively consumed in a regular SQL SELECT, significantly limiting their use.

2. Allow plpgsql functions to efficiently return sets of records

By ‘efficiently’, I mean that a large result set should not be required to be staged before the executor is able to process it. Staging is not an issue for small sets, but for large sets and especially if they are subject to further processing, intermediate staging it is a performance disadvantage.

PostgreSQL already has some support for this functions that return SETOFs. At present, plpgsql cannot take advantage of this support while also achieving the efficiency criteria because, as the documentation [4] notes, all returned data is staged before it is retuned.

Addressing this limitation could also of benefit to other PLs, however a quick survey finds at least PL Python is already well-adapted to efficiently return SETOFs.

3. Allow optimisation of a returned query

plpgsql offers a syntactic shortcut to return the results of a SQL query directly. Despite appearing to return a query, the RETURN QUERY syntax actually returns the /results/ of the query [4]. This means the optimiser has no opportunity to influence its execution, such as by pushing down expressions into its WHERE clause, or taking advantage of alternative indexes to modify its sort order.

Other PLs are similarly constrained. Most PLs lack plpgsql’s syntactic sugar, but even though some PLs are better able to efficiently return SETOFs row-by-row, the optimiser cannot see “inside” the query that the PL executes even if its intent is to return the results directly.

Only SQL language functions are afforded the luxury of integration into then outer statement’s plan [8], but even SQL language functions are somewhat constrained in the types of dynamism that are permitted.

4. Allow a set-returning query to be supplied as an input parameter

It is possible to supply a scalar value, or function call that returns a scalar value as an input parameter. And, with problems 1 & 2 addressed, sets can be supplied as input parameters. However, a literal set-returning SQL query cannot be supplied as a parameter (not without PostgreSQL invoking the ‘calling’ function for each row in the set). REFCURSORs cannot be constructed natively from SQL.

A simplistic response would provide a trivial constructor for REFCURSORs, accepting the query as a text parameter. However it is quite unnatural to supply SQL in textual form, more especially to do so safely. So the challenge is to allow a set-returning subquery to be provided as a parameter in literal form.

Why are these problems important?
---------------------------------

My personal wish is for PostgreSQL to offer a feature set that is consistent with itself and without arbitrary limitation. A better argument might be that that it is desirable to match the features of other RDBMSs [9], or for reason of the use cases they address [10] that are new or push the boundaries of what PostgreSQL can do [1], or that are important such as fulfilling a DW/ETL need [11], or to more directly address approaches touted of NoSQL such as Map Reduce [12].

Design and implementation
-------------------------

1. Set returning function (SRF) for REFCURSOR

Tackling first problems 1 and (part of) 2, it seems easy and obvious to allow that REFCURSORs can be consumed in a SELECT query.

PostgreSQL already allows an array to be consumed one record per entry via the UNNEST(anyarray) built-in function [13]. Overloading UNNEST() to accept a REFCURSOR argument can be done easily, and the executor’s SRF machinery allows the result set to be consumed efficiently.

With such an implementation, and given a REFCURSOR-returning function, kv() the following syntax illustrates trivial usage:

SELECT *
FROM UNNEST (kv ('A'))
AS (key TEXT, val NUMERIC);

With this UNNEST() construct, it is possible to consume a returned REFCURSOR inline in a single SQL statement.

To complete the example, the function kv() might trivially be defined as:

CREATE FUNCTION kv (suffix text)
RETURNS REFCURSOR
STABLE LANGUAGE plpgsql
AS $$
DECLARE
cur REFCURSOR;
BEGIN
OPEN cur FOR EXECUTE
'SELECT * FROM kv_table_' || suffix;
RETURN cur;
END;
$$;

Other obvious example setup includes:

create table kv_table_a (key text, value numeric);
insert into kv_table_a select 'ITEM_A', generate_series (0, 99);

It is also possible to accept a REFCURSOR as an input parameter:

CREATE FUNCTION limit_val (input_refcur text, val_limit numeric)
RETURNS REFCURSOR
STABLE LANGUAGE plpgsql
AS $$
DECLARE
cur REFCURSOR;
BEGIN
OPEN cur FOR SELECT * FROM UNNEST (input_refcur::REFCURSOR) as (key text, value numeric) WHERE value < val_limit;
RETURN cur;
END;
$$;

SELECT *
FROM UNNEST (limit_val (kv ('A')::text, 10))
AS (key TEXT, val NUMERIC);

Having this construct, it is possible for plpgsql FUNCTION’s to both accept and return REFCURSOR variables. In plpgsql, is would be unnecessary to cast the REFCURSOR to and from text, but other PLs, presumably lacking first class knowledge of the REFCURSOR type, probably need to do so. In above example, limit_val() illustrates how a REFCURSOR can be accepted in text form.

In my patch, I’ve used the SPI APIs to access the Portal which lies behind the REFCURSOR. Although SPI seems to offer an easy interface, and it’s also what plpgsql uses internally, I’m not sure it wouldn’t be better to access the Portal directly.

It is interesting to note that Oracle names its similar construct TABLE() [9], rather than UNNEST(), and in later releases, its use is optional. TABLE is a reserved word and it would be unusual to overload it, although we could educate the parser to treat it specially. Oracle compatibility is an important consideration, but this is a niche area.

If we continue to use REFCURSOR, it is difficult to make some function call-like construct optional because it is already syntactically possible to select FROM a REFCURSOR-returning function. For example, SELECT * FROM kv (‘A’), is a valid and effective expression, despite being of questionable construction and utility.

An alternative might build on top of existing support for returning SETOFs, which already requires no UNNEST()-like construct. This is attractive in principle, but it makes some of the further extensions discussed below more awkward (in my opinion).

2. Query-bound REFCURSORs

Problem 3 could be addressed by educating the planner in how to extract the query inside the Portal behind the REFCURSOR.

At present, REFCURSORs are only bound to a query once they are OPENed, but when they are OPENed, the query also is fully planned, ready for immediate execution. The ambition is to allow the REFCURSOR’s query to be inlined within the outer query’s plan, so it seems wasteful to expend planner cycles, only for the plan to be thrown away.

The proposed implementation would (1) create an intermediate BOUND state for REFCURSORs, and (2) educate plpgsql about how to BIND unbound_cursorvar FOR query.

My first idea was to modify the REFCURSOR type itself, creating a new state, and adding storage for the BOUND query, but this seems unfeasible without extensive hackery. The REFCURSOR type is a trivial veneer atop a C string (which contains the Portal name), so there is no internal structure to extend.

So my plan is to retain the direct coupling of REFCURSOR<->Portal, and to allow plpgsql to set the query text at BIND time via PortalDefineQuery(). Existing plpgsql code should be unaffected as it need know nothing about the new BOUND state.

In order for any of this to work, the planner has to be able to extract the query from the returned Portal. It seems inline_set_returning_function() is the right place to make this extraction. Adding specific recognition for a function call to UNNEST() with single argument of type REFCURSOR is easy, and existing eval_const_expressions() semantics determine whether the single argument expression can be evaluated at plan time. (Of course, if it cannot, then it falls through to be processed at execution time by the REFCURSOR set returning function (SRF) described above.)

It feels slightly uncomfortable to have UNNEST(REFCURSOR) operate as a regular function, and also have specific recognition for UNNEST() elsewhere in the planner machinery. Arguably, this is already a kind of specific knowledge that inline_set_returning_function() has for SQL language FUNCTIONs, but the recognition I propose for UNNEST(REFCURSOR) is much narrower. An alternative might be to introduce a new type that inline_set_returning_function() can recognise (for example, INLINEABLE_QUERY), or to entirely separate the SRF case from the inlining case at a syntax level (for example, retaining UNNEST(REFCURSOR) as the SRF, but using INLINE(REFCURSOR) for the case at hand).

I’d welcome input here. Although the implementation seems quite feasible, the SQL and plpgsql syntax is less obvious.

3. Literal subquery type

Problem 4 could be addressed by educating the parser specially about the REFCURSOR type when faced with a literal query.

Consider this example:

SELECT *
FROM UNNEST (
limit_val (
REFCURSOR (
SELECT key || '_COPY', value FROM kv_table_a
), 25)
) AS (key TEXT, val NUMERIC);

The REFCURSOR(literal_query) construct could be made to result in a BOUND REFCURSOR, in this case, with SELECT key || '_COPY', value FROM kv_table_a, and then passed as a constant parameter to limit_val().

Usefully, at present, the construct yields a syntax error: although REFCURSOR(string) is an already valid construct (being equivalent to CAST (string AS REFCURSOR)), it’s not acceptable to provide a literal subquery without parenthesis. So, while REFCURSOR ((SELECT 'some_cursor')) is roughly equivalent to CAST ('some_cursor' AS REFCURSOR), the near-alternative of REFCURSOR (SELECT 'some_cursor') is quite simply not valid.

If I’m right, the task is simply a matter of ‘plumbing through’ special knowledge of a REFCURSOR(literal_subquery) construct through the parser. It seems tricky as there are many affected code sites, but the work seems uncomplicated.

Educating the parser about special types seems, again, slightly uncomfortable. An alternative might be to create an intermediate construct: for example, QUERY(literal_subquery) might be made to return the parse tree as a pg_node_tree (similar to how VIEWs are exposed in the system catalogue), and REFCURSOR(pg_node_tree) could consume it, yielding a joined-up construct of REFCURSOR(QUERY(literal_subquery)). However, we might also simply accept REFCURSOR(literal_subquery) to be special, and if/when other need is found for a literal subquery as a parameter, then this becomes the way to supply it.

For point of reference, Oracle seems to have gone the way of making CURSOR(literal_subquery) do something similar, yielding a REF CURSOR, which allows the CURSOR to be passed by reference [2].

Other problems
--------------

1. This contribution does not actually address the limitation in plpgsql, that the “current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function” [4]. My original investigation [3] presumed this limitation to apply generally to all PLs, but I now realise this is not the case: at least the Python PL allows efficient return of SETOFs [5]. With this in mind, I see the plpgsql limitation as less encumbering (as plpython is presumably broadly available) but I’d be interested to know if this view is shared.

2. A perhaps more significant problem is the apparent duplication of plpgsql’s RETURN QUERY syntax. One could perhaps conceive that plpgsql supported an additional marker, for example, RETURN [INLINEABLE] QUERY. It is difficult to see this fitting well with other PLs.

3. The current proposal also requires to declare the expected record with an AS (...) construction. This is rather inconvenient, but it is difficult to see how it could be avoided.

4. Other PLs can use REFCURSORS by virtue of REFCURSOR being a thin veneer atop string. It is coherent if one understands how the PostgreSQL type system works, but quite strange otherwise. Better integration into other PLs and their type systems might be important.

5. As mentioned, SETOF is a near-equivalent for some use cases. There’s no way to cast the results of a function RETURNING SETOF to a REFCURSOR without something like REFCURSOR (SELECT * from <some function>(...)). It might be useful to offer a little syntactic sugar. Perhaps we could invent NEST(SETOF <some RECORD type>) could return REFCURSOR.

References
----------

[1] https://www.postgresql.org/message-id/CAAcdnuzHDnDX73jBb9CZZE%3DSv3gDTk8E6-SGRGYEUZbLAy0QRA%40mail.gmail.com <https://www.postgresql.org/message-id/CAAcdnuzHDnDX73jBb9CZZE=Sv3gDTk8E6-SGRGYEUZbLAy0QRA@mail.gmail.com>
[2] https://docs.oracle.com/database/121/SQLRF/expressions006.htm#SQLRF52077 <https://docs.oracle.com/database/121/SQLRF/expressions006.htm#SQLRF52077>
[3] https://www.postgresql.org/message-id/DE237364-EB7A-4851-9337-F9F6491E46A6%40qqdd.eu <https://www.postgresql.org/message-id/DE237364-EB7A-4851-9337-F9F6491E46A6%40qqdd.eu>
[4] https://www.postgresql.org/docs/10/plpgsql-control-structures.html#PLPGSQL-STATEMENTS-RETURNING <https://www.postgresql.org/docs/10/plpgsql-control-structures.html#PLPGSQL-STATEMENTS-RETURNING>
[5] https://www.postgresql.org/docs/10/plpython-database.html#id-1.8.11.15.3 <https://www.postgresql.org/docs/10/plpython-database.html#id-1.8.11.15.3>
[6] https://www.postgresql.org/docs/10/plpgsql-cursors.html#PLPGSQL-CURSOR-DECLARATIONS <https://www.postgresql.org/docs/10/plpgsql-cursors.html#PLPGSQL-CURSOR-DECLARATIONS>
[7] https://github.com/tada/pljava/wiki/Using-jdbc <https://github.com/tada/pljava/wiki/Using-jdbc>
[8] https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions <https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions>
[9] https://docs.oracle.com/cd/B19306_01/appdev.102/b14289/dcitblfns.htm <https://docs.oracle.com/cd/B19306_01/appdev.102/b14289/dcitblfns.htm>
[10] https://www.postgresql.org/message-id/flat/005701c6dc2c%2449011fc0%240a00a8c0%40trivadis.com <https://www.postgresql.org/message-id/flat/005701c6dc2c%2449011fc0%240a00a8c0%40trivadis.com>
[11] https://oracle-base.com/articles/misc/pipelined-table-functions <https://oracle-base.com/articles/misc/pipelined-table-functions>
[12] https://blogs.oracle.com/datawarehousing/mapreduce-oracle-tablefunctions <https://blogs.oracle.com/datawarehousing/mapreduce-oracle-tablefunctions>
[13] https://www.postgresql.org/docs/10/functions-array.html <https://www.postgresql.org/docs/10/functions-array.html>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-09-17 20:06:16 Re: Small const correctness patch
Previous Message Alvaro Herrera 2019-09-17 19:58:00 Re: [HACKERS] Restricting maximum keep segments by repslots