Re: more than 2GB data string save

From: Allan Kamau <kamauallan(at)gmail(dot)com>
To: Steve Atkins <steve(at)blighty(dot)com>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: more than 2GB data string save
Date: 2010-02-10 07:34:03
Message-ID: ab1ea6541002092334g7ef631b7p3a923d7266fe6d00@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Feb 10, 2010 at 10:11 AM, Steve Atkins <steve(at)blighty(dot)com> wrote:
>
> On Feb 9, 2010, at 10:38 PM, AI Rumman wrote:
>
>> Thanks for your quick answes.
>>
>> But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files?
>
> No, no way at all.
>
> A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for.
>
> Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library.
>
> If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist.
>
> Cheers,
>  Steve
>
>>
>> On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins <steve(at)blighty(dot)com> wrote:
>>
>> On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
>>
>> > On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman <rummandba(at)gmail(dot)com> wrote:
>> >> How to save 2 GB or more text string in Postgresql?
>> >> Which data type should I use?
>> >
>> > If you have to you can use either the lo interface, or you can use
>> > bytea.  Large Object (i.e. lo) allows for access much like fopen /
>> > fseek  etc in C, but the actual data are not stored in a row with
>> > other data, but alone in the lo space.  Bytea is a legit type that you
>> > can have as one of many in a row, but you retrieve the whole thing at
>> > once when you get the row.
>>
>> Bytea definitely won't handle more than 1 GB. I don't think the lo interface
>> will handle more than 2GB.
>>
>> >
>> > Preferred way to store 2GB data is to put it into a file and put the
>> > name of the file into the database.
>>
>>
>> This.
>>
>> Cheers,
>>  Steve
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

I have some simple (and possibly wrong) solutions are presented below.
a) Store only the file name in the db and leave the file on the file
system where you can access it after after reading in the file name
from the DB to determine it's location
This solution may offer the possibility of using many of the very fast
and efficient file content search and manipulation algorithms and
tools that are part of most OSes (for instance grep, sed, ...)

b) If you really need to store the file contents in the DB. You may
write a small method which reads in the file contents in batches of n
number of bytes and writes each of these batches in a new table row,
but one thing to remember is to capture and store the batch number
(which should be sequencial) in another field of the respective batch.
And finally indicate the last batch by maybe setting another field in
the row as 'true'.
Doing so will provide you with the opportunity to parallize your the
batch persistance to DB and still have a way to correctly put together
the sections of the file contents (the batches) while reconstructing
the file.
Also will ensure the size of your transaction will be as small the
size of a batch.

Allan.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2010-02-10 07:35:43 Re: dump of 700 GB database
Previous Message Steve Atkins 2010-02-10 07:30:40 Re: more than 2GB data string save