Versioned, chunked documents

From: Ivan Voras <ivoras(at)freebsd(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Versioned, chunked documents
Date: 2012-04-01 22:38:08
Message-ID: jlalch$quh$1@dough.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I have documents which are divided into chunks, so that the (ordered)
concatenation of chunks make the whole document. Each of the chunks may
be edited separately and past versions of the chunks need to be kept.

The structure looks fairly simple:

CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
...
);

CREATE TABLE documents_chunks (
id SERIAL PRIMARY KEY,
ctime TIMESTAMP NOT NULL,
documents_id INTEGER REFERENCES documents(id),
seq INTEGER NOT NULL, -- sequence within the document
content TEXT,
...
);

The first goal is to retrieve the latest version of the whole document,
made from the latest versions of all chunks, but later the goal will
also be to fetch the whole version at some point in time (i.e. with
chunks created before a point in time).

I did the first goal by creating two helper views:

CREATE VIEW documents_chunks_last_version_chunk_ids AS
SELECT documents_id, max(id), seq FROM documents_chunks GROUP BY
documents_id, seq;

CREATE VIEW documents_chunks_last_version_content AS
SELECT documents_chunks.documents_id, content
FROM documents_chunks
JOIN documents_chunks_last_version_chunk_ids ON
documents_chunks.id=documents_chunks_last_version_chunk_ids.max
ORDER BY documents_chunks_last_version_chunk_ids.seq;

There are indexes on the document_chunks fields seq and documents_id.

Everything looked fine until I examined the output of EXPLAIN ANALYZE
and saw this:

db=> set enable_seqscan to off;
SET
db=> explain analyze select * from documents_chunks_last_version_content
where documents_id=1;

EXPLAIN output given in: http://explain.depesz.com/s/mpY

The query output seems correct on this test case:
db=> select * from documents_chunks_last_version_content where
documents_id=1;
documents_id | content
--------------+---------
1 | C1, v2
1 | C2, v3
1 | C3, v1
(3 rows)

This huge cost of 10000000000 which appeared out of nowhere in the
EXPLAIN output and the seq scan worry me - where did that come from?
There are absolutely no unindexed fields in the query, and the result
set of the aggregate ("max") is very small.

Of course, I might be doing the whole structure wrong - any ideas?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ondrej Ivanič 2012-04-01 22:58:43 Re: Versioned, chunked documents
Previous Message Jerry Sievers 2012-04-01 21:40:00 Re: Parameter setting in multi-statement command; I got bit today