Re: Incremental View Maintenance, take 2

From: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
To: jian he <jian(dot)universality(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Incremental View Maintenance, take 2
Date: 2023-08-27 13:35:51
Message-ID: 20230827223551.0668853c92d45ec25304187e@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 29 Jun 2023 00:40:45 +0800
jian he <jian(dot)universality(at)gmail(dot)com> wrote:

> On Wed, Jun 28, 2023 at 4:06 PM Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
> >
> > On Wed, 28 Jun 2023 00:01:02 +0800
> > jian he <jian(dot)universality(at)gmail(dot)com> wrote:
> >
> > > On Thu, Jun 1, 2023 at 2:47 AM Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
> > > >
> > > > On Thu, 1 Jun 2023 23:59:09 +0900
> > > > Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> wrote:
> > > >
> > > > > Hello hackers,
> > > > >
> > > > > Here's a rebased version of the patch-set adding Incremental View
> > > > > Maintenance support for PostgreSQL. That was discussed in [1].
> > > >
> > > > > [1] https://www.postgresql.org/message-id/flat/20181227215726.4d166b4874f8983a641123f5%40sraoss.co.jp
> > > >
> > > > ---------------------------------------------------------------------------------------
> > > > * Overview
> > > >
> > > > Incremental View Maintenance (IVM) is a way to make materialized views
> > > > up-to-date by computing only incremental changes and applying them on
> > > > views. IVM is more efficient than REFRESH MATERIALIZED VIEW when
> > > > only small parts of the view are changed.
> > > >
> > > > ** Feature
> > > >
> > > > The attached patchset provides a feature that allows materialized views
> > > > to be updated automatically and incrementally just after a underlying
> > > > table is modified.
> > > >
> > > > You can create an incementally maintainable materialized view (IMMV)
> > > > by using CREATE INCREMENTAL MATERIALIZED VIEW command.
> > > >
> > > > The followings are supported in view definition queries:
> > > > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins)
> > > > - some built-in aggregate functions (count, sum, avg, min, max)
> > > > - GROUP BY clause
> > > > - DISTINCT clause
> > > >
> > > > Views can contain multiple tuples with the same content (duplicate tuples).
> > > >
> > > > ** Restriction
> > > >
> > > > The following are not supported in a view definition:
> > > > - Outer joins
> > > > - Aggregates otehr than above, window functions, HAVING
> > > > - Sub-queries, CTEs
> > > > - Set operations (UNION, INTERSECT, EXCEPT)
> > > > - DISTINCT ON, ORDER BY, LIMIT, OFFSET
> > > >
> > > > Also, a view definition query cannot contain other views, materialized views,
> > > > foreign tables, partitioned tables, partitions, VALUES, non-immutable functions,
> > > > system columns, or expressions that contains aggregates.
> > > >
> > > > ---------------------------------------------------------------------------------------
> > > > * Design
> > > >
> > > > An IMMV is maintained using statement-level AFTER triggers.
> > > > When an IMMV is created, triggers are automatically created on all base
> > > > tables contained in the view definition query.
> > > >
> > > > When a table is modified, changes that occurred in the table are extracted
> > > > as transition tables in the AFTER triggers. Then, changes that will occur in
> > > > the view are calculated by a rewritten view dequery in which the modified table
> > > > is replaced with the transition table.
> > > >
> > > > For example, if the view is defined as "SELECT * FROM R, S", and tuples inserted
> > > > into R are stored in a transiton table dR, the tuples that will be inserted into
> > > > the view are calculated as the result of "SELECT * FROM dR, S".
> > > >
> > > > ** Multiple Tables Modification
> > > >
> > > > Multiple tables can be modified in a statement when using triggers, foreign key
> > > > constraint, or modifying CTEs. When multiple tables are modified, we need
> > > > the state of tables before the modification.
> > > >
> > > > For example, when some tuples, dR and dS, are inserted into R and S respectively,
> > > > the tuples that will be inserted into the view are calculated by the following
> > > > two queries:
> > > >
> > > > "SELECT * FROM dR, S_pre"
> > > > "SELECT * FROM R, dS"
> > > >
> > > > where S_pre is the table before the modification, R is the current state of
> > > > table, that is, after the modification. This pre-update states of table
> > > > is calculated by filtering inserted tuples and appending deleted tuples.
> > > > The subquery that represents pre-update state is generated in get_prestate_rte().
> > > > Specifically, the insterted tuples are filtered by calling IVM_visible_in_prestate()
> > > > in WHERE clause. This function checks the visibility of tuples by using
> > > > the snapshot taken before table modification. The deleted tuples are contained
> > > > in the old transition table, and this table is appended using UNION ALL.
> > > >
> > > > Transition tables for each modification are collected in each AFTER trigger
> > > > function call. Then, the view maintenance is performed in the last call of
> > > > the trigger.
> > > >
> > > > In the original PostgreSQL, tuplestores of transition tables are freed at the
> > > > end of each nested query. However, their lifespan needs to be prolonged to
> > > > the end of the out-most query in order to maintain the view in the last AFTER
> > > > trigger. For this purpose, SetTransitionTablePreserved is added in trigger.c.
> > > >
> > > > ** Duplicate Tulpes
> > > >
> > > > When calculating changes that will occur in the view (= delta tables),
> > > > multiplicity of tuples are calculated by using count(*).
> > > >
> > > > When deleting tuples from the view, tuples to be deleted are identified by
> > > > joining the delta table with the view, and tuples are deleted as many as
> > > > specified multiplicity by numbered using row_number() function.
> > > > This is implemented in apply_old_delta().
> > > >
> > > > When inserting tuples into the view, each tuple is duplicated to the
> > > > specified multiplicity using generate_series() function. This is implemented
> > > > in apply_new_delta().
> > > >
> > > > ** DISTINCT clause
> > > >
> > > > When DISTINCT is used, the view has a hidden column __ivm_count__ that
> > > > stores multiplicity for tuples. When tuples are deleted from or inserted into
> > > > the view, the values of __ivm_count__ column is decreased or increased as many
> > > > as specified multiplicity. Eventually, when the values becomes zero, the
> > > > corresponding tuple is deleted from the view. This is implemented in
> > > > apply_old_delta_with_count() and apply_new_delta_with_count().
> > > >
> > > > ** Aggregates
> > > >
> > > > Built-in count sum, avg, min, and max are supported. Whether a given
> > > > aggregate function can be used or not is checked by using its OID in
> > > > check_aggregate_supports_ivm().
> > > >
> > > > When creating a materialized view containing aggregates, in addition
> > > > to __ivm_count__, more than one hidden columns for each aggregate are
> > > > added to the target list. For example, columns for storing sum(x),
> > > > count(x) are added if we have avg(x). When the view is maintained,
> > > > aggregated values are updated using these hidden columns, also hidden
> > > > columns are updated at the same time.
> > > >
> > > > The maintenance of aggregated view is performed in
> > > > apply_old_delta_with_count() and apply_new_delta_with_count(). The SET
> > > > clauses for updating columns are generated by append_set_clause_*().
> > > >
> > > > If the view has min(x) or max(x) and the minimum or maximal value is
> > > > deleted from a table, we need to update the value to the new min/max
> > > > recalculated from the tables rather than incremental computation. This
> > > > is performed in recalc_and_set_values().
> > > >
> > > > ---------------------------------------------------------------------------------------
> > > > * Details of the patch-set (v28)
> > > >
> > > > > The patch-set consists of the following eleven patches.
> > > >
> > > > In the previous version, the number of patches were nine.
> > > > In the latest patch-set, the patches are divided more finely
> > > > aiming to make the review easier.
> > > >
> > > > > - 0001: Add a syntax to create Incrementally Maintainable Materialized Views
> > > >
> > > > The prposed syntax to create an incrementally maintainable materialized
> > > > view (IMMV) is;
> > > >
> > > > CREATE INCREMENTAL MATERIALIZED VIEW AS SELECT .....;
> > > >
> > > > However, this syntax is tentative, so any suggestions are welcomed.
> > > >
> > > > > - 0002: Add relisivm column to pg_class system catalog
> > > >
> > > > We add a new field in pg_class to indicate a relation is IMMV.
> > > > Another alternative is to add a new catalog for managing materialized
> > > > views including IMMV, but I am not sure if we want this.
> > > >
> > > > > - 0003: Allow to prolong life span of transition tables until transaction end
> > > >
> > > > This patch fixes the trigger system to allow to prolong lifespan of
> > > > tuple stores for transition tables until the transaction end. We need
> > > > this because multiple transition tables have to be preserved until the
> > > > end of the out-most query when multiple tables are modified by nested
> > > > triggers. (as explained above in Design - Multiple Tables Modification)
> > > >
> > > > If we don't want to change the trigger system in such way, the alternative
> > > > is to copy the contents of transition tables to other tuplestores, although
> > > > it needs more time and memory.
> > > >
> > > > > - 0004: Add Incremental View Maintenance support to pg_dump
> > > >
> > > > This patch enables pg_dump to output IMMV using the new syntax.
> > > >
> > > > > - 0005: Add Incremental View Maintenance support to psql
> > > >
> > > > This patch implements tab-completion for the new syntax and adds
> > > > information of IMMV to \d meta-command results.
> > > >
> > > > > - 0006: Add Incremental View Maintenance support
> > > >
> > > > This patch implements the basic IVM feature.
> > > > DISTINCT and aggregate are not supported here.
> > > >
> > > > When an IMMV is created, the view query is checked, and if any
> > > > non-supported feature is used, it raises an error. If it is ok,
> > > > triggers are created on base tables and an unique index is
> > > > created on the view if possible.
> > > >
> > > > In BEFORE trigger, an entry is created for each IMMV and the number
> > > > of trigger firing is counted. Also, the snapshot just before the
> > > > table modification is stored.
> > > >
> > > > In AFTER triggers, each transition tables are preserved. The number
> > > > of trigger firing is counted also here, and when the firing number of
> > > > BEFORE and AFTER trigger reach the same, it is deemed the final AFTER
> > > > trigger call.
> > > >
> > > > In the final AFTER trigger, the IMMV is maintained. Rewritten view
> > > > query is executed to generate delta tables, and deltas are applied
> > > > to the view. If multiple tables are modified simultaneously, this
> > > > process is iterated for each modified table. Tables before processed
> > > > are represented in "pre-update-state", processed tables are
> > > > "post-update-state" in the rewritten query.
> > > >
> > > > > - 0007: Add DISTINCT support for IVM
> > > >
> > > > This patch adds DISTINCT clause support.
> > > >
> > > > When an IMMV including DISTINCT is created, a hidden column
> > > > "__ivm_count__" is added to the target list. This column has the
> > > > number of duplicity of the same tuples. The duplicity is calculated
> > > > by adding "count(*)" and GROUP BY to the view query.
> > > >
> > > > When an IMMV is maintained, the duplicity in __ivm_count__ is updated,
> > > > and a tuples whose duplicity becomes zero can be deleted from the view.
> > > > This logic is implemented by SQL in apply_old_delta_with_count and
> > > > apply_new_delta_with_count.
> > > >
> > > > Columns starting with "__ivm_" are deemed hidden columns that doesn't
> > > > appear when a view is accessed by "SELECT * FROM ....". This is
> > > > implemented by fixing parse_relation.c.
> > > >
> > > > > - 0008: Add aggregates support in IVM
> > > >
> > > > This patch provides codes for aggregates support, specifically
> > > > for builtin count, sum, and avg.
> > > >
> > > > When an IMMV containing an aggregate is created, it is checked if this
> > > > aggregate function is supported, and if it is ok, some hidden columns
> > > > are added to the target list.
> > > >
> > > > When the IMMV is maintained, the aggregated value is updated as well as
> > > > related hidden columns. The way of update depends the type of aggregate
> > > > functions, and SET clause string is generated for each aggregate.
> > > >
> > > > > - 0009: Add support for min/max aggregates for IVM
> > > >
> > > > This patch adds min/max aggregates support.
> > > >
> > > > This is separated from #0008 because min/max needs more complicated
> > > > work than count, sum, and avg.
> > > >
> > > > If the view has min(x) or max(x) and the minimum or maximal value is
> > > > deleted from a table, we need to update the value to the new min/max
> > > > recalculated from the tables rather than incremental computation.
> > > > This is performed in recalc_and_set_values().
> > > >
> > > > TIDs and keys of tuples that need re-calculation are returned as a
> > > > result of the query that deleted min/max values from the view using
> > > > RETURNING clause. The plan to recalculate and set the new min/max value
> > > > are stored and reused.
> > > >
> > > > > - 0010: regression tests
> > > >
> > > > This patch provides regression tests for IVM.
> > > >
> > > > > - 0011: documentation
> > > >
> > > > This patch provides documantation for IVM.
> > > >
> > > > ---------------------------------------------------------------------------------------
> > > > * Changes from the Previous Version (v27)
> > > >
> > > > - Allow TRUNCATE on base tables
> > > >
> > > > When a base table is truncated, the view content will be empty if the
> > > > view definition query does not contain an aggregate without a GROUP clause.
> > > > Therefore, such views can be truncated.
> > > >
> > > > Aggregate views without a GROUP clause always have one row. Therefore,
> > > > if a base table is truncated, the view will not be empty and will contain
> > > > a row with NULL value (or 0 for count()). So, in this case, we refresh the
> > > > view instead of truncating it.
> > > >
> > > > - Fix bugs reported by huyajun [1]
> > > >
> > > > [1] https://www.postgresql.org/message-id/tencent_FCAF11BCA5003FD16BDDFDDA5D6A19587809%40qq.com
> > > >
> > > > ---------------------------------------------------------------------------------------
> > > > * Discussion
> > > >
> > > > ** Aggregate support
> > > >
> > > > There were a few suggestions that general aggregate functions should be
> > > > supported [2][3], which may be possible by extending pg_aggregate catalog.
> > > > However, we decided to leave supporting general aggregates to the future work [4]
> > > > because it would need substantial works and make the patch more complex and
> > > > bigger.
> > > >
> > > > There has been no opposite opinion on this. However, if we need more discussion
> > > > on the design of aggregate support, we can omit aggregate support for the first
> > > > release of IVM.
> > > >
> > > > [2] https://www.postgresql.org/message-id/20191128140333.GA25947%40alvherre.pgsql
> > > > [3] https://www.postgresql.org/message-id/CAM-w4HOvDrL4ou6m%3D592zUiKGVzTcOpNj-d_cJqzL00fdsS5kg%40mail.gmail.com
> > > > [4] https://www.postgresql.org/message-id/20201016193034.9a4c44c79fc1eca7babe093e%40sraoss.co.jp
> > > >
> > > > ** Hidden columns
> > > >
> > > > In order to support DISTINCT or aggregates, our implementation uses hidden columns.
> > > >
> > > > Columns starting with "__ivm_" are hidden columns that doesn't appear when a
> > > > view is accessed by "SELECT * FROM ....". For this aim, parse_relation.c is
> > > > fixed. There was a proposal to enable hidden columns by adding a new flag to
> > > > pg_attribute [5], but this thread is no longer active, so we decided to check
> > > > the hidden column by its name [6].
> > > >
> > > > [5] https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com
> > > > [6] https://www.postgresql.org/message-id/20201016193034.9a4c44c79fc1eca7babe093e%40sraoss.co.jp
> > > >
> > > > ** Concurrent Transactions
> > > >
> > > > When the view definition has more than one table, we acquire an exclusive
> > > > lock before the view maintenance in order to avoid inconsistent results.
> > > > This behavior was explained in [7]. The lock was improved to use weaker lock
> > > > when the view has only one table based on a suggestion from Konstantin Knizhnik [8].
> > > > However, due to the implementation that uses ctid for identifying target tuples,
> > > > we still have to use an exclusive lock for DELETE and UPDATE.
> > > >
> > > > [7] https://www.postgresql.org/message-id/20200909092752.c91758a1bec3479668e82643%40sraoss.co.jp
> > > > [8] https://www.postgresql.org/message-id/5663f5f0-48af-686c-bf3c-62d279567e2a%40postgrespro.ru
> > > >
> > > > ** Automatic Index Creation
> > > >
> > > > When a view is created, a unique index is automatically created if
> > > > possible, that is, if the view definition query has a GROUP BY or
> > > > DISTINCT, or if the view contains all primary key attributes of
> > > > its base tables in the target list. It is necessary for efficient
> > > > view maintenance. This feature is based on a suggestion from
> > > > Konstantin Knizhnik [9].
> > > >
> > > > [9] https://www.postgresql.org/message-id/89729da8-9042-7ea0-95af-e415df6da14d%40postgrespro.ru
> > > >
> > > >
> > > > ** Trigger and Transition Tables
> > > >
> > > > We implemented IVM based on triggers. This is because we want to use
> > > > transition tables to extract changes on base tables. Also, there are
> > > > other constraint that are using triggers in its implementation, like
> > > > foreign references. However, if we can use transition table like feature
> > > > without relying triggers, we don't have to insist to use triggers and we
> > > > might implement IVM in the executor directly as similar as declarative
> > > > partitioning.
> > > >
> > > > ** Feature to be Supported in the First Release
> > > >
> > > > The current patch-set supports DISTINCT and aggregates for built-in count,
> > > > sum, avg, min and max. Do we need all these feature for the first IVM release?
> > > > Supporting DISTINCT and aggregates needs discussion on hidden columns, and
> > > > for supporting min/max we need to discuss on re-calculation method. Before
> > > > handling such relatively advanced feature, maybe, should we focus to design
> > > > and implement of the basic feature of IVM?
> > > >
> > > >
> > > > Any suggestion and discussion are welcomed!
> > > >
> > > > Regards,
> > > > Yugo Nagata
> > > >
> > > > --
> > > > Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
> > > >
> > > >
> > >
> > >
> > > > The followings are supported in view definition queries:
> > > > - SELECT ... FROM ... WHERE ..., joins (inner joins, self-joins)
> > >
> > >
> > > > Also, a view definition query cannot contain other views, materialized views,
> > > > foreign tables, partitioned tables, partitions, VALUES, non-immutable functions,
> > > > system columns, or expressions that contains aggregates.
> > >
> > > Does this also apply to tableoid? but tableoid is a constant, so it
> > > should be fine?
> > > can following two queries apply to this feature.
> > > select tableoid, unique1 from tenk1;
> >
> > Currently, this is not allowed because tableoid is a system column.
> > As you say, tableoid is a constant, so we can allow. Should we do this?
> >
> > > select 1 as constant, unique1 from tenk1;
> >
> > This is allowed, of course.
> >
> > > I didn't apply the patch.(will do later, for someone to test, it would
> > > be a better idea to dump a whole file separately....).
> >
> > Thank you! I'm looking forward to your feedback.
> > (I didn't attach a whole patch separately because I wouldn't like
> > cfbot to be unhappy...)
> >
> > Regards,
> > Yugo Nagata
> >
> > --
> > Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
>
> I played around first half of regress patch.

I'm so sorry for the late reply.

> these all following queries fails.
>
> CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_rename AS
> SELECT DISTINCT * , 1 as "__ivm_count__" FROM mv_base_a;
>
> CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_rename AS
> SELECT DISTINCT * , 1 as "__ivm_countblablabla" FROM mv_base_a;
>
> CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_rename AS
> SELECT DISTINCT * , 1 as "__ivm_count" FROM mv_base_a;
>
> CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_rename AS
> SELECT DISTINCT * , 1 as "__ivm_count_____" FROM mv_base_a;
>
> CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_rename AS
> SELECT DISTINCT * , 1 as "__ivm_countblabla" FROM mv_base_a;
>
> so the hidden column reserved pattern "__ivm_count.*"? that would be a lot....

Column names which start with "__ivm_" are prohibited because hidden columns
using this pattern are used for handling views with aggregate or DISTINCT.
Even when neither aggregate or DISINCT is used, such column name is used
for handling tuple duplicates in views. So, if we choose not to allow
tuple duplicates in the initial version of IVM, we would remove this
restriction for now....

>
> select * from pg_matviews where matviewname = 'mv_ivm_1';
> don't have relisivm option. it's reasonable to make it in view pg_matviews?

Make sense. I'll do it.

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2023-08-27 13:41:01 Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
Previous Message Masahiko Sawada 2023-08-27 12:53:00 Re: [PoC] Improve dead tuple storage for lazy vacuum