Re: GSoC - proposal - Materialized Views in PostgreSQL

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: pavelbaros <baros(dot)p(at)seznam(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: GSoC - proposal - Materialized Views in PostgreSQL
Date: 2010-04-10 03:53:29
Message-ID: 4BBFF639.6010305@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pavelbaros wrote:
> I am also waiting for approval for my repository named
> "materialized_view" on git.postgresql.org, so I could publish
> completed parts.

Presuming that you're going to wander there and get assigned what looks
like an official repo name for this project is a bit...optimistic. I
would recommend that you publish to something like github instead (you
can fork http://github.com/postgres/postgres ), and if the work looks
good enough that it gets picked up by the community maybe you migrate it
onto the main site eventually. git.postgresql.org is really not setup
to be general hosting space for everyone who has a PostgreSQL related
project; almost every repo on there belongs to someone who has already
been a steady project contributor for a number of years.

(Switching to boilerplate mode for a paragraph...) You have picked a
PostgreSQL feature that is dramatically more difficult than it appears
to be, and I wouldn't expect you'll actually finish even a fraction of
your goals in a summer of work. You're at least in plentiful
company--most students do the same. As a rule, if you see a feature on
our TODO list that looks really useful and fun to work on, it's only
still there because people have tried multiple times to build it
completely but not managed to do so because it's harder than it
appears. This is certainly the case with materialized views.

You've outlined a reasonable way to build a prototype that does a
limited implementation here. The issue is what it will take to extend
that into being production quality for the real-world uses of
materialized views. How useful your prototype is depends on how well it
implements a subset of that in a way that will get used by the final design.

The main hidden complexity in this particular project relates to
handling view refreshes. The non-obvious problem is that when the view
updates, you need something like a SQL MERGE to really handle that in a
robust way that doesn't conflict with concurrent access to queries
against the materialized view. And work on MERGE support is itself
blocked behind the fact that PostgreSQL doesn't have a good way to lock
access to a key value that doesn't exist yet--what other databases call
key range locking. See the notes for "Add SQL-standard
MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo
for more information.

You can work around that to build a prototype by grabbing a full table
lock on the materialized view when updating it, but that's not a
production quality solution. Solving that little detail is actually
more work than the entire project you've outlined. Your suggested
implementation--"In function CloseIntoRel executor swap relfilenode's of
temp table and original table and finally delete temp table"--is where
the full table lock is going to end up at. The exact use cases that
need materialized views cannot handle a CLUSTER-style table recreation
each time that needs an exclusive lock to switchover, so that whole part
of your design is going to be a prototype that doesn't work at all like
what needs to get built to make this feature committable. It's also not
a reasonable assumption that you have enough disk space to hold a second
copy of the MV in a production system.

Once there's a good way to merge updates, how to efficiently generate
them against the sort of large data sets that need materalized views--so
you just write out the updates rather than a whole new copy--is itself a
large project with a significant quantity of academic research to absorb
before starting. Dan Colish at Portland State has been playing around
with prototypes for the specific problem of finding a good algorithm for
view refreshing that is compatible with PostgreSQL's execution model.
He's already recognized the table lock issue here and for the moment is
ignoring that part. I don't have a good feel yet for how long the
targeted update code will take to mature, but based on what I do know I
suspect that little detail is also a larger effort than the entire scope
you're envisioning. There's a reason why the MIT Press compendium
"Materialized Views: Techniques, Implementations, and Applications" is
over 600 pages long--I hope you've already started digging through that
material.

Now, with all that said, that doesn't mean there's not a useful project
for you buried in this mess. The first two steps in your plan:

1) create materialized view
2) change rewriter

Include building a prototype grammer, doing an initial executor
implementation, and getting some sort of rewriter working. That is
potentially good groundwork to lay here. I would suggest that you
completely drop your step 3:

3) create command that takes snapshot (refresh MV)

Because you cannot built that in a way that will be useful (and by that
I mean committable quality) until there's a better way to handle updates
than writing a whole new table and grabbing a full relation lock to
switch to it. To do a good job just on the first two steps should take
at least a whole summer anyway--there's a whole stack of background
research needed I haven't seen anyone do yet, and that isn't on your
plan yet. There is a precedent for taking this approach. After getting
stalled trying to add the entirety of easy partitioning to PostgreSQL,
the current scope has been scaled back to just trying to get the syntax
and on-disk structure right, then finish off the implementation. See
http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how
that's been broken into those two major chunks.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-04-10 04:32:38 Re: GSoC - proposal - Materialized Views in PostgreSQL
Previous Message Joseph Adams 2010-04-10 00:28:38 Re: GSOC PostgreSQL partitioning issue