Re: large xml database

From: Viktor Bojović <viktor(dot)bojovic(at)gmail(dot)com>
To: Rob Sargent <robjsargent(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: large xml database
Date: 2010-10-31 22:52:42
Message-ID: AANLkTi=iQkAA3Bc0ua-ZauXwfFJg+eLqvEb4rb+NFaj2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:

> Viktor Bojović wrote:
>
>>
>>
>> On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsargent(at)gmail(dot)com<mailto:
>> robjsargent(at)gmail(dot)com>> wrote:
>>
>>
>>
>>
>> Viktor Bojovic' wrote:
>>
>>
>>
>> On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
>> <cloos(at)jhcloos(dot)com <mailto:cloos(at)jhcloos(dot)com>
>> <mailto:cloos(at)jhcloos(dot)com <mailto:cloos(at)jhcloos(dot)com>>> wrote:
>>
>> >>>>> "VB" == Viktor Bojovic' <viktor(dot)bojovic(at)gmail(dot)com
>> <mailto:viktor(dot)bojovic(at)gmail(dot)com>
>>
>> <mailto:viktor(dot)bojovic(at)gmail(dot)com
>> <mailto:viktor(dot)bojovic(at)gmail(dot)com>>> writes:
>>
>> VB> i have very big XML documment which is larger than 50GB and
>> want to
>> VB> import it into databse, and transform it to relational
>> schema.
>>
>> Were I doing such a conversion, I'd use perl to convert the
>> xml into
>> something which COPY can grok. Any other language, script
>> or compiled,
>> would work just as well. The goal is to avoid having to
>> slurp the
>> whole
>> xml structure into memory.
>>
>> -JimC
>> --
>> James Cloos <cloos(at)jhcloos(dot)com <mailto:cloos(at)jhcloos(dot)com>
>> <mailto:cloos(at)jhcloos(dot)com <mailto:cloos(at)jhcloos(dot)com>>>
>>
>>
>> OpenPGP: 1024D/ED7DAEA6
>>
>>
>> The insertion into dabase is not very big problem.
>> I insert it as XML docs, or as varchar lines or as XML docs in
>> varchar format. Usually i use transaction and commit after
>> block of 1000 inserts and it goes very fast. so insertion is
>> over after few hours.
>> But the problem occurs when i want to transform it inside
>> database from XML(varchar or XML format) into tables by parsing.
>> That processing takes too much time in database no matter if
>> it is stored as varchar lines, varchar nodes or XML data type.
>>
>> -- ---------------------------------------
>> Viktor Bojovic'
>>
>> ---------------------------------------
>> Wherever I go, Murphy goes with me
>>
>>
>> Are you saying you first load the xml into the database, then
>> parse that xml into instance of objects (rows in tables)?
>>
>>
>> Yes. That way takes less ram then using twig or simple xml, so I tried
>> using postgre xml functions or regexes.
>>
>>
>>
>> --
>> ---------------------------------------
>> Viktor Bojović
>> ---------------------------------------
>> Wherever I go, Murphy goes with me
>>
> Is the entire load a set of "entry" elements as your example contains?
> This I believe would parse nicely into a tidy but non-trivial schema
> directly without the "middle-man" of having xml in db (unless of course you
> prefer xpath to sql ;) )
>
> The single most significant caveat I would have for you is Beware:
> Biologists involved. Inconsistency (at least overloaded concepts) almost
> assured :). EMBL too is suspect imho, but I've been out of that arena for a
> while.
>
>
Unfortunately some elements are always missing, so I had to create script
which scanned whole document of swissprot and trembl , and stored it into
file to use it as a template to build a code generator if I find a best
parser for this purpose. To parse all elements it in one day I should use
parser which is capable to parse at least 128 entry blocks for an second @
2.4GHz. You are right about inconsistency, im constantly have problems with
PDB files.

btw.
you have mentioned "This I believe would parse nicely into a tidy but
non-trivial schema directly", does it mean that postgre has a support for
restoring the database schema from xml files?

--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

Attachment Content-Type Size
sp.nodes.txt text/plain 8.0 KB
tr.nodes.txt text/plain 4.9 KB

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Rob Sargent 2010-10-31 23:07:32 Re: large xml database
Previous Message Rob Sargent 2010-10-31 21:26:01 Re: large xml database