Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features

From: Nico Williams <nico(at)cryptonector(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features
Date: 2016-11-28 16:44:48
Message-ID: 20161128164447.GA11117@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 26, 2016 at 04:13:19PM -0600, Jim Nasby wrote:
> On 11/22/16 9:11 PM, Nico Williams wrote:
> >But we needed a method for recording deltas from REFRESHes, and that's
> >not supported. So I coded up my own version of materialized views, in
> >PlPgSQL, that does provide a history feature.
>
> Getting history tracking included in core is going to take a LOT of effort;
> not from a code standpoint, but to get everyone to agree on the use cases,
> interface, etc. In short: I wouldn't go there.

Thanks for the reply.

Well, I want to go there :) And, frankly, I think the interface details
aren't that hard, though I could be wrong.

The bare minimum[0] columns for the history table are: a transaction ID,
a column of the view's record type[1] for "old"/deleted rows, and a
column of the view's record type for "new"/inserted rows.

If a row was deleted, then you get a row in the history table with that
row's record value as the "old" column's value, with the "new" column's
value being NULL. Similarly for insertions. For updates, if there's a
primary key[2] we could have a single history row with non-NULL values
for all three columns, else two history rows, one with the old row as
the "old" column's value and NULL for the "new" column, and another
history row with a NULL for the "old" column and the new row as the
value of the "new" column.

The trickiest thing here is the "transaction ID". In my use case I have
to deal with an ORM's notion of transaction ID, so no PG notion of TX ID
helps me :(

> What *would* be useful is work on generating delta information from a set of
> changes to a table: see below.

My implementation does that too (via triggers).

> >Besides a history feature, this includes the ability to record changes
> >made to a materialized view's materialization table, which means I can
> >have triggers that update the materialized view.
>
> Just looking at the commend block in the file, it's not clear what you mean
> by this. Does this mean if you ALTER the "materialization table" (which I
> assume is the materialized output?), does the view somehow get modified?

It means that if you INSERT/UPDATE/DELETE on a VIEW materialized with my
implementation, then those changes are recorded in the history table for
the view. However, it is important that such changes be consistent with
the VIEW, else the next refresh will undo them (naturally).

> >Of particular interest may be the fact that the FULL OUTER JOIN that PG
> >does for REFRESH CONCURRENTLY, and which I copied here, doesn't deal
> >well with views that have NULLs in any columns used in the join. It
> >would be nice to have an equijoin that uses IS NOT DISTINCT FROM rather
> >than just =, and then refreshing could use such a join in order to deal
> >properly with NULLs.
>
> Well, you could potentially just cast the composite types to text and join
> on that, but...

Only if there's a non-ambiguous text representation of row values.
Also, this might make refreshes slow.

> Kevin Grittner (author of the original matviews patch) has mentioned
> detecting changes to underlying relations previously[1]. Getting even a
> simple form of that working is going to be critical for any kind of
> incremental matview updating. It sounds like you're doing something
> different from incremental update, but the core problem is still the same:
> how to efficiently identify changes to underlying matview data and apply
> those changes to the view.

Incremental update is *exactly* what I'm doing. It's also exactly what
REFRESH CONCURRENTLY does today, except that REFRESH CONCURRENTLY
doesn't save the history but my implementation does.

> I suggest taking a look at what Kevin's written about that, as well as the

Sure.

> paper he mentioned. Some of this does assume that you have the equivalent to
> NEW and OLD, but IIRC those are not actually mandatory (ie: you could use
> the current matview contents as OLD and the dynamic view as NEW). Even a

My implementation is rather trivial. You might want to look at it. Its
guts are actually copied from what PG already does for REFRESH
CONCURRENTLY.

> pure SQL/plpgsql implementation of what's in the paper that Kevin mentioned
> would probably be valuable to ongoing work, as well as being applicable to
> what you've done.
>
> 1: https://www.postgresql.org/message-id/1371480075.55528.YahooMailNeo@web162901.mail.bf1.yahoo.com

Thanks for the link.

My implementation even supports auto-refresh of stale VIEWs and uses
NOTIFY/LISTEN channels as the FIFO. There is a PGSQL-coded conditional
refresh function (refresh_if_needed()) that a daemon calls periodically
and whenever it gets a NOTIFY on a channel that it LISTENs on; the
"sla", last refresh timestamp, and "refresh needed" indicators for each
materialized view, are recorded in a table.

That's what the pqasyncnotifier.c file is about: it makes it possible[3]
to shell-code a daemon that refreshes stale materialized views
automatically. The daemon for that is not included in our contribution,
but it is trivial, though ideally it should be coded in C in the
postgres server and run as a child process.

[0] In my implementation I also added a timestamp, but this -and related
information about transactions- should be normalized away.

[1] That's right, I use record types so as to make sure that all history
tables have the exact same number of columns.

I must say it's annoying that I can't declare a TABLE as having the
same record type as another TABLE, and that I can't CAST from one
record type to another with the exact same number of columns and the
same types for them.

[2] VIEWs don't get PRIMARY KEYs, naturally, but a MATERIALIZED VIEW is
a VIEW plus a hidden TABLE to hold the materialization. A PK makes
a lot of sense for a MATERIALIZED VIEW, which is why the syntax I
gave for an ideal CREATE MATERIALIZED VIEW includes an optional PK
declaration.

[3] It's not possible to shell-code this with psql(1) because psql(1)
doesn't write to stdout when a notification is received on a
channel; it writes to stdout about notifications only when input on
stdin is _also_ received.

Nico
--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-28 16:52:07 Re: A bug of psql completion
Previous Message David Fetter 2016-11-28 16:26:39 Re: Tackling JsonPath support