Re: large xml database

From: Andy Colson <andy(at)squeakycode(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: large xml database
Date: 2010-10-31 13:51:52
Message-ID: 4CCD7478.40208@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/30/2010 4:48 PM, Viktor Bojović wrote:
> Hi,
> i have very big XML documment which is larger than 50GB and want to
> import it into databse, and transform it to relational schema.
> When splitting this documment to smaller independent xml documments i
> get ~11.1mil XML documents.
> I have spent lots of time trying to get fastest way to transform all
> this data but every time i give up because it takes too much time.
> Sometimes more than month it would take if not stopped.
> I have tried to insert each line as varchar into database and parse it
> using plperl regex..
> also i have tried to store every documment as XML and parse it, but it
> is also to slow.
> i have tried to store every documment as varchar but it is also slow
> when using regex to get data.
>
> many tries have failed because 8GB of ram and 10gb of swap were not
> enough. also sometimes i get that more than 2^32 operations were
> performed, and functions stopped to work.
>
> i wanted just to ask if someone knows how to speed this up.
>
> thanx in advance
> --
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me

It might help a little to know a few more detail. Like what is in the
xml doc. When you say convert to relational, do you mean multiple
tables (and no more xml tags), or do you mean a table with a blob column
that contains some xml fragment?

I have imported millions of rows and never run out of memory. The
database will take care of itself unless you are doing something really
bad. I'd guess its the xml parser running out of ram and not the
database. Are you using dom or sax?

You say it took too much time. What did? The xml parsing? The
database inserts? Were you cpu bound or io bound?

What tools are you using to write this in? What OS are you on? What
version of PG? You know... just a "few" more details :-)

-Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Paul 2010-10-31 16:37:28 NOT IN (NULL) ?
Previous Message Pavel Stehule 2010-10-31 10:47:19 Re: Implementing replace function