Use arrays or not?

From: Roelant Ossewaarde <miep(at)belboek(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Use arrays or not?
Date: 2004-04-29 16:41:06
Message-ID: 20040429164059.GC24558@belboek.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-sql

Hi,

I am building an application using postgresql to store XML-records. There is a
debate within the group of developers about the best way to store our data. I
hope you can help us make a decision.

The data consists of XML-records, with a lot of XML-fields. I want to store
the XML as it is, so taking the information from the XML-records and then
storing it in a different-from-XML-format is not an option.

Each XML-record describes data about one book. If an update of bookdata comes,
the XML itself is not changed, but a new XML-record is stored with the updated
data. Via a complex scheme of combining a base record and its updates, the
final dataset is produced that is used in the application.

There are different XML-formats that need to be combined. Right now, we can
handle three different XML-formats, each with its own structure (but all
describing book-data).

Searching is done via a simple table lookup on three different fields: title,
author and subject. The data for these fields is extracted from the
database. Each book has a unique identifier (EAN13, derivative of ISBN).

Here is one way to organize the database:
table title:
TITLE | EAN13, indexing on TITLE

table author:
AUTHOR | EAN13, indexing on AUTHOR

table subject:
SUBJECT | EAN13, indexing on SUBJECT.

Finally:
table record:
EAN13 | ARRAY OF XML-records.

It's the last table that I am most curious (and worried) about, the question
being mainly what the optimal way of structuring that table is. Option 1 is
the given option: adding/deleting an XML-record for the same book requires
adding/deleting it to/from the array of XML-records.

Option 2 would be something like this:
EAN13 | XML-record
where, if a book has several records describing it, there are multiple entries
of the EAN13|XML-record - pair. Adding an XML-record for the same book,
requires adding a new entry to the table as a whole.

So, option 1-tables look like this:
EAN13 | ARRAY OF XML-records
0001 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}
0002 | {<XML1>...</XML1>, <XML2>...</XML2>, ...}

Option-2 tables look like this:
EAN13 | ARRAY OF XML-records
0001 | <XML1>...</XML1>
0001 | <XML2>...</XML2>
0002 | <XML1>...</XML1>
0002 | <XML2>...</XML2>

We can't decide which one is best. These are some issues we can think of:

Indexing: For option 1, the EAN13-index remains unique, even if you have
multiple XML-records; for option 2 it does not, since multiple XML-records are
stored as multiple tuples. On the other hand, an additional internal index can
be used to link the several tuples of option 2 to the information in the
`lookup'-tables (author, title, keyword). Does any of these two options
increase query efficiency, ie. speed?

Database growth: On average, the information about a book is updated three
times per year. In option 1, this means that the length of the table does not
increase, but the width does. If we choose option 2, if we have three updates
per book each year, the length of the table triples, but the width does
not. What is more costly to store for postgres, long arrays or long tables?

Integrity: Option 1 means that our software needs to keep track of all the
bookkeeping for arrays, since such support is quite rudimentary in
postgres. For example, it is hard to take out a record from the middle of an
array. Also, a multidimensional array, which contains for each record the
record itself and its type, is even harder to maintain. Option 2 has a simpler
datatype, so integrity can be easier inforced using the standard
postgres-machinery of variable-types etc.

Arrays are non-standard SQL, and I hear that PHP-support for postgres & arrays
is rudimentary. So that might be an argument to avoid using them, and go for
option 2. From the standpoint of performance (or wisdom), can you help me
decide what I should choose? Or is there maybe an even better way to structure
my data?

Thanks for any contribution!

Roelant.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Manfred Koizar 2004-04-29 16:56:36 Re: [HACKERS] Number of pages in a random sample
Previous Message ohp 2004-04-29 13:20:18 Re: Wierd context-switching issue on Xeon

Browse pgsql-sql by date

  From Date Subject
Next Message Josh Berkus 2004-04-29 17:23:58 Re: Use arrays or not?
Previous Message Rob 2004-04-29 14:18:36 Re: Equivalant of SQL Server's Nchar and NVARCHAR