Proposing WITH ITERATIVE

From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposing WITH ITERATIVE
Date: 2020-04-27 16:52:48
Message-ID: CADUqk8UpPEoYnGXX3_DMKyBM=t2gv0jkWQWct1mKVDv72GNN5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey, everyone.

It's been quite a while since I last contributed a patch but, as this is a
new feature, I wanted to gather feedback before doing so. I've found this
functionality, already in use at Universität Tübingen, to be both highly
beneficial in many of my queries as well as a small change to Postgres - a
good trade-off IMO.

As you know, Postgres currently supports SQL:1999 recursive common table
expressions, constructed using WITH RECURSIVE, which permit the computation
of growing relations (e.g., transitive closures.) Although it is possible
to use this recursion for general-purpose iterations, doing so is a
deviation from its intended use case. Accordingly, an iterative-only use of
WITH RECURSIVE often results in sizable overhead and poor optimizer
decisions. In this email, I'd like to propose a similar but
iterative-optimized form of CTE - WITH ITERATIVE.

In that it can reference a relation within its definition, this iterative
variant has similar capabilities as recursive CTEs. In contrast to
recursive CTEs, however, rather than appending new tuples, this variant
performs a replacement of the intermediate relation. As such, the final
result consists of a relation containing tuples computed during the last
iteration only. Why would we want this?

This type of computation pattern is often found in data and graph mining
algorithms. In PageRank, for example, the initial ranks are updated in each
iteration. In clustering algorithms, assignments of data tuples to clusters
are updated in every iteration. Something these types of algorithms have in
common is that they operate on fixed-size datasets, where only specific
values (ranks, assigned clusters, etc.) are updated.

Rather than stopping when no new tuples are generated, WITH ITERATIVE stops
when a user-defined predicate evaluates to true. By providing a
non-appending iteration concept with a while-loop-style stop criterion, we
can massively speed up query execution due to better optimizer decisions.
Although it is possible to implement the types of algorithms mentioned
above using recursive CTEs, this feature has two significant advantages:

First, iterative CTEs consume significantly less memory. Consider a CTE of
N tuples and I iterations. With recursive CTEs, such a relation would grow
to (N * I) tuples. With iterative CTEs, however, all prior iterations are
discarded early. As such, the size of the relation would remain N,
requiring only a maximum of (2 * N) tuples for comparisons of the current
and the prior iteration. Furthermore, in queries where the number of
executed iterations represents the stop criterion, recursive CTEs require
even more additional per-tuple overhead - to carry along the iteration
counter.

Second, iterative CTEs have lower query response times. Because of its
smaller size, the time to scan and process the intermediate relation, to
re-compute ranks, clusters, etc., is significantly reduced.

In short, iterative CTEs retain the flexibility of recursive CTEs, but
offer a significant advantage for algorithms which don't require results
computed throughout all iterations. As this feature deviates only slightly
from WITH RECURSIVE, there is very little code required to support it (~250
loc.)

If there's any interest, I'll clean-up their patch and submit. Thoughts?

--
Jonah H. Harris

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Zhang 2020-04-27 16:53:16 Re: WIP/PoC for parallel backup
Previous Message Robert Haas 2020-04-27 16:33:55 improving basebackup.c's file-reading code