Re: Implementing Incremental View Maintenance

From: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "r(dot)takahashi_2(at)fujitsu(dot)com" <r(dot)takahashi_2(at)fujitsu(dot)com>, 'Zhihong Yu' <zyu(at)yugabyte(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Implementing Incremental View Maintenance
Date: 2021-11-29 05:48:26
Message-ID: 20211129144826.c9d42c1f61495c6983d8b6b1@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

This is a response to a comment in "Commitfest 2021-11 Patch Triage - Part 1" [1].

> 2138: Incremental Materialized View Maintenance
> ===============================================
> There seems to be concensus on the thread that this is a feature that we want,
> and after initial design discussions there seems to be no disagreements with
> the approach taken. The patch was marked ready for committer almost a year
> ago, but have since been needs review (which seems correct). The size of the
> patchset and the length of the thread make it hard to gauge just far away it
> is, maybe the author or a review can summarize the current state and outline
> what is left for it to be committable.

[1] https://www.postgresql.org/message-id/6EDAAF93-1663-41D0-9148-76739104943E%40yesql.se

I'll describe recent discussions and current status of this thread.

* Recent Discussions and Current status

1.
Previously, we proposed a patchset that supports outer-joins, some sub-queries
and CTEs. However, aiming to reduce the size of the patchset, I proposed to omit
these features from the first version of the patch in my post at 2021-08-02 [2].

Currently, we are proposing Incremental View Maintenance feature for PostgreSQL 15
that supports following queries in the view definition query.

- inner joins including self-join
- DISTINCT and views with tuple duplicates
- some built-in aggregate functions (count, sum, agv, min, and max)

Is it OK? Although there has been no opposite opinion, we want to confirm it.

[2] https://www.postgresql.org/message-id/20210802152834.ecbaba6e17d1957547c3a55d%40sraoss.co.jp

2.
Recently, There was a suggestion that we should support partitioned tables from
Ryohei Takahashi, but I decided to not support it in the first release of IVM.
Takahshi-san agreed with it, and the documentation will be fixed soon [3].

[3] https://www.postgresql.org/message-id/20211125154717.777e9d35ddde5f2e0d5d8355%40sraoss.co.jp

3.
Takahashi-san pointed out that restoring pg_dump results causes an error. I am fixing
it now.[4]

[4] https://www.postgresql.org/message-id/20211125163710.2f32ae3d4be5d5f9ade020b6%40sraoss.co.jp

The remaining is the summary of our proposal of IVM feature, its design, and past discussions.

---------------------------------------------------------------------------------------
* Features

Incremental View Maintenance (IVM) is a way to make materialized views
up-to-date by computing only incremental changes and applying them on
views. IVM is more efficient than REFRESH MATERIALIZED VIEW when
only small parts of the view are changed.

This patchset provides a feature that allows materialized views to be
updated automatically and incrementally just after a underlying table
is modified.

You can create an incementally maintainable materialized view (IMMV)
by using CREATE INCREMENTAL MATERIALIZED VIEW command.

The followings are supported in view definition queries:
- SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins)
- some built-in aggregate functions (count, sum, avg, min, max)
- GROUP BY clause
- DISTINCT clause

Views can contain multiple tuples with the same content (duplicate tuples).

The following are not supported in a view definition:
- Outer joins
- Aggregates otehr than above, window functions, HAVING
- Sub-queries, CTEs
- Set operations (UNION, INTERSECT, EXCEPT)
- DISTINCT ON, ORDER BY, LIMIT, OFFSET

Also, a view definition query cannot contain other views, materialized views,
foreign tables, partitioned tables, partitions, VALUES, non-immutable functions,
system columns, or expressions that contains aggregates.

---------------------------------------------------------------------------------------
* Design

An IMMV is maintained using statement-level AFTER triggers. When an IMMV is
created, triggers are automatically created on all base tables contained in the
view definition query.

When a table is modified, the change that occurred in the table are extracted
as transition tables in the AFTER triggers. Then, changes that will occur in
the view are calculated by a rewritten view dequery in which the modified table
is replaced with the transition table. For example, if the view is defined as
"SELECT * FROM R, S", and tuples inserted into R are stored in a transiton table
dR, the tuples that will be inserted into the view are calculated as the result
of "SELECT * FROM dR, S".

** Multiple Tables Modification

Multiple tables can be modified in a statement when using triggers, foreign key
constraint, or modifying CTEs. When multiple tables are modified, we need
the state of tables before the modification. For example, when some tuples,
dR and dS, are inserted into R and S respectively, the tuples that will be
inserted into the view are calculated by the following two queries:

"SELECT * FROM dR, S_pre"
"SELECT * FROM R, dS"

where S_pre is the table before the modification, R is the current state of
table, that is, after the modification. This pre-update states of table
is calculated by filtering inserted tuples using cmin/xmin system columns,
and appending deleted tuples which are contained in the old transition table.
This is implemented in get_prestate_rte().

Transition tables for each modification are collected in each AFTER trigger
function call. Then, the view maintenance is performed in the last call of
the trigger.

In the original PostgreSQL, tuplestores of transition tables are freed at the
end of each nested query. However, their lifespan needs to be prolonged to
the end of the out-most query in order to maintain the view in the last AFTER
trigger. For this purpose, SetTransitionTablePreserved is added in trigger.c.

** Duplicate Tulpes

When calculating changes that will occur in the view (= delta tables),
multiplicity of tuples are calculated by using count(*).

When deleting tuples from the view, tuples to be deleted are identified by
joining the delta table with the view, and the tuples are deleted as many as
specified multiplicity by numbered using row_number() function.
This is implemented in apply_old_delta().

When inserting tuples into the view, tuples are duplicated to the specified
multiplicity using generate_series() function. This is implemented in
apply_new_delta().

** DISTINCT clause

When DISTINCT is used, the view has a hidden column __ivm_count__ that
stores multiplicity for tuples. When tuples are deleted from or inserted into
the view, the values of __ivm_count__ column is decreased or increased as many
as specified multiplicity. Eventually, when the values becomes zero, the
corresponding tuple is deleted from the view. This is implemented in
apply_old_delta_with_count() and apply_new_delta_with_count().

** Aggregates

Built-in count sum, avg, min, and max are supported. Whether a given
aggregate function can be used or not is checked by using its OID in
check_aggregate_supports_ivm().

When creating a materialized view containing aggregates, in addition
to __ivm_count__, more than one hidden columns for each aggregate are
added to the target list. For example, columns for storing sum(x),
count(x) are added if we have avg(x). When the view is maintained,
aggregated values are updated using these hidden columns, also hidden
columns are updated at the same time.

The maintenance of aggregated view is performed in
apply_old_delta_with_count() and apply_new_delta_with_count(). The SET
clauses for updating columns are generated by append_set_clause_*().

If the view has min(x) or max(x) and the minimum or maximal value is
deleted from a table, we need to update the value to the new min/max
recalculated from the tables rather than incremental computation. This
is performed in recalc_and_set_values().

---------------------------------------------------------------------------------------
* Discussion

** Aggregate support

There were a few suggestions that general aggregate functions should be
supported [5][6], which may be possible by extending pg_aggregate catalog.
However, we decided to left supporting general aggregates to the future work [7]
because it would need substantial works and make the patch more complex and
bigger. There has been no opposite opinion on this.

[5] https://www.postgresql.org/message-id/20191128140333.GA25947%40alvherre.pgsql
[6] https://www.postgresql.org/message-id/CAM-w4HOvDrL4ou6m%3D592zUiKGVzTcOpNj-d_cJqzL00fdsS5kg%40mail.gmail.com
[7] https://www.postgresql.org/message-id/20201016193034.9a4c44c79fc1eca7babe093e%40sraoss.co.jp

** Hidden columns

Columns starting with "__ivm_" are hidden columns that doesn't appear when a
view is accessed by "SELECT * FROM ....". For this aim, parse_relation.c is
fixed. There was a proposal to enable hidden columns by adding a new flag to
pg_attribute [8], but this thread is no longer active, so we decided to check
the hidden column by its name [9].

[8] https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com
[9] https://www.postgresql.org/message-id/20201016193034.9a4c44c79fc1eca7babe093e%40sraoss.co.jp

** Concurrent Transactions

When the view definition has more than one table, we acquire an exclusive
lock before the view maintenance in order to avoid inconsistent results.
This behavior was explained in [10]. The lock was improved to use weaker lock
when the view has only one table based on a suggestion from Konstantin Knizhnik [11].

[10] https://www.postgresql.org/message-id/20200909092752.c91758a1bec3479668e82643%40sraoss.co.jp
[11] https://www.postgresql.org/message-id/5663f5f0-48af-686c-bf3c-62d279567e2a%40postgrespro.ru

** Automatic Index Creation

When a view is created, a unique index is automatically created if
possible, that is, if the view definition query has a GROUP BY or
DISTINCT, or if the view contains all primary key attributes of
its base tables in the target list. It is necessary for efficient
view maintenance. This feature is based on a suggestion from
Konstantin Knizhnik [12].

[12] https://www.postgresql.org/message-id/89729da8-9042-7ea0-95af-e415df6da14d%40postgrespro.ru

** Others

There are some other changes in core for IVM implementation.
There has been no opposite opinion on any ever.

- syntax

The command to create an incrementally maintainable materialized
view (IMMV) is "CREATE INCREMENTAL MATERIALIZED VIEW". The new
keyword "INCREMENTAL" is added.

- pg_class

A new attribue "relisivm" is added to pg_class to indicate
that the relation is an IMMV.

- deptype

DEPENDENCY_IMMV(m) was added to pg_depend as a new deptype. This is necessary
to clear that a certain trigger is related to IMMV, especially when We dropped
IVM triggers from the view when REFRESH ... WITH NO DATA is executed [13].

[13] https://www.postgresql.org/message-id/20210922185343.548883e81b8baef14a0193c5%40sraoss.co.jp

---------------------------------------------------------------------------------------

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-11-29 05:56:01 Re: Non-superuser subscription owners
Previous Message Dilip Kumar 2021-11-29 05:44:23 Re: Synchronizing slots from primary to standby