Re: Versioned, chunked documents

From: Martin Gregorie <martin(at)gregorie(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Versioned, chunked documents
Date: 2012-04-01 23:52:35
Message-ID: 1333324355.1754.157.camel@zappa.gregorie.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Correction interpolated - see below

On Mon, 2012-04-02 at 00:22 +0100, Martin Gregorie wrote:
> On Mon, 2012-04-02 at 00:38 +0200, Ivan Voras wrote:
> > Hi,
> >
> > I have documents which are divided into chunks, so that the (ordered)
> > concatenation of chunks make the whole document. Each of the chunks may
> > be edited separately and past versions of the chunks need to be kept.
> >
> > The structure looks fairly simple:
> >
> > The first goal is to retrieve the latest version of the whole document,
> > made from the latest versions of all chunks, but later the goal will
> > also be to fetch the whole version at some point in time (i.e. with
> > chunks created before a point in time).
> >
> > I did the first goal by creating two helper views:
> >
> > CREATE VIEW documents_chunks_last_version_chunk_ids AS
> > SELECT documents_id, max(id), seq FROM documents_chunks GROUP BY
> > documents_id, seq;
> >
> > CREATE VIEW documents_chunks_last_version_content AS
> > SELECT documents_chunks.documents_id, content
> > FROM documents_chunks
> > JOIN documents_chunks_last_version_chunk_ids ON
> > documents_chunks.id=documents_chunks_last_version_chunk_ids.max
> > ORDER BY documents_chunks_last_version_chunk_ids.seq;
> >
> > There are indexes on the document_chunks fields seq and documents_id.
> >
> > Everything looked fine until I examined the output of EXPLAIN ANALYZE
> > and saw this:
> >
> I'm not surprised. First guess: I'd use
> id (FK of documents), seq, ctime
>
> as the prime key of document_chunk, which would work for your initial
> requirement but is far too simplistic to deal with the general
> requirement of retrieving a specific document version. You'd probably
> need something like:
>
> CREATE TABLE documents (
> id SERIAL PRIMARY KEY,
> title TEXT NOT NULL,
> ...
> );
>
> Create table version (
> version_no serial primary key,
> author var char 40,
> ctime timestamp );
>
> create table document_version (
> id serial references document(id),
> version_number serial references version(version_no),
> primary_key (id, version_no),
> );
I used a version number in the key because a TIMESTAMP is much too fine
grained unless you're planning to obtain its value before committing all
the changed document_chunks affected by this editing session.

BTW, why use document_chunks when a text field can hold megabytes,
especially if they will be concatenated to form a complete document
which is then edited as a whole item and before being split into chunks
and saved back to the database? If the chunks represent chapters or
other logical sections that are always edited separately, why not name
them to reflect this? 'document_chunk' just sounds too arbitrary to me.

>
> CREATE TABLE documents_chunks (
> id SERIAL references document_version(id),
> version_number serial references document_version(version_number),
> seq serial,
> content TEXT,
> primary_key(id, version_number, seq)
==> should be
primary_key(id, seq, version_number) # of course!
> };
>

> Disclaimer: this is not syntax checked or tested. It may/or may not
> match your requirements, but since I haven't seen your ERD or the 3NF
> you derived from it I can't offer any more applicable advice.

Martin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jay Levitt 2012-04-02 04:29:11 Re: Switching to Homebrew as recommended Mac install?
Previous Message Martin Gregorie 2012-04-01 23:22:02 Re: Versioned, chunked documents