Re: Implementing Incremental View Maintenance

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: Mitar <mmitar(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Implementing Incremental View Maintenance
Date: 2019-01-31 14:20:32
Message-ID: 20190131232032.feaab8f20d29cdc75898c896@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 7 Jan 2019 00:39:00 -0800
Mitar <mmitar(at)gmail(dot)com> wrote:

> That sounds great! I am interested in this topic because I am
> interested in reactive/live queries and support for them in
> PostgreSQL. [1]
>
> In that context, the problem is very similar: based on some state of
> query results and updated source tables, determine what should be new
> updates to send to the client describing changes to the query results.
> So after computing those incremental changes, instead of applying them
> to materialized view I would send them to the client. One could see
> materialized views only type of consumers of such information about
> incremental change.
>
> So I would like to ask if whatever is done in this setting is done in
> a way that one could also outside of the context of materialized view.
> Not sure what would API be thought.

I didn't know about reactive/live queries but this seems share a part of
problem with IVM, so we might have common API.

BTW, what is uecase of reactive/live queries? (just curious)

> > For these reasons, we started to think to implement IVM without relying on OIDs
> > and made a bit more surveys.
>
> I also do not see much difference between asking users to have primary
> key on base tables or asking them to have OIDs. Why do you think that
> a requirement for primary keys is a hard one? I think we should first
> focus on having IVM with base tables with primary keys. Maybe then
> later on we could improve on that and make it also work without.
>
> To me personally, having unique index on source tables and also on
> materialized view is a reasonable restriction for this feature.
> Especially for initial versions of it.

Initially, I chose to use OIDs for theoretical reason, that is, to handle
"bag-semantics" which allows duplicate rows in tables. However, I agree
that we start from the restriction of having unique index on base tables.

> > If we can represent a change of UPDATE on a base table as query-like rather than
> > OLD and NEW, it may be possible to update the materialized view directly instead
> > of performing delete & insert.
>
> Why do you need OLD and NEW? Don't you need just NEW and a list of
> columns which changed from those in NEW? I use such diffing query [4]
> to represent changes: first column has a flag telling if the row is
> representing insert, update, and remove, the second column tells which
> column are being changed in the case of the update, and then the NEW
> columns follow.

According the change propagation equation approach, OLD is necessary
to calculate tuples in MV to be deleted or modified. However, if tables
has unique keys, such tuples can be identifeid using the keys, so
OLD may not be needed, at least in eager approach.

In lazy approach, OLD contents of table is useful. For example, with
a join view MV = R * S, when dR is inserted into R and dS is inserted
into S, the delta to be inserted into MV will be

dMV = (R_old * dS) + (dR * S_new)
= (R_old * dS) + (dR * S_old) + (dR * dS)

, hence the old contents of tables R and S are needed.

> I think that maybe standardizing structure for representing those
> changes would be a good step towards making this modular and reusable.
> Because then we can have three parts:
>
> * Recording and storing changes in a standard format.
> * A function which given original data, stored changes, computes
> updates needed, also in some standard format.
> * A function which given original data and updates needed, applies them.

> I think if we split things into three parts as I described above, then
> this is just a question of configuration. Or you call all three inside
> one trigger to update in "eager" fashion. Or you store computed
> updates somewhere and then on demand apply those in "lazy" fashion.

I agree that defining the format to represent changes is important. However,
I am not sure both of eager and lazy can be handled in the same manner. I'll
consider about this more.

> > In the eager maintenance approache, we have to consider a race condition where two
> > different transactions change base tables simultaneously as discussed in [4].
>
> But in the case of "lazy" maintenance there is a mirror problem: what
> if later changes to base tables invalidate some previous change to the
> materialized view. Imagine that one cell in a base table is first
> updated too "foo" and we compute an update for the materialized view
> to set it to "foo". And then the same cell is updated to "bar" and we
> compute an update for the materialized view again. If we have not
> applied any of those updates (because we are "lazy") now the
> previously computed update can be discarded. We could still apply
> both, but it would not be efficient.

In our PoC implementation, I handled this situation by removing
old contents from NEW delata table. In your example, when the base
table is updated from "foo" to "bar", the "foo" tuple is removed
from and the "bar" tuple is inserted in NEW delta and the delta
of MV is computed using the final NEW delta.

Regards,
--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2019-01-31 14:33:31 Re: interval type additional option
Previous Message Andres Freund 2019-01-31 13:56:08 Re: Add client connection check during the execution of the query