Re: Postgresql as a dictionary coder backend?

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Attila Nagy <bra(at)fsn(dot)hu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Postgresql as a dictionary coder backend?
Date: 2011-01-25 22:13:39
Message-ID: AANLkTi=7hNjidLpqTsurdqNcsX9ZZcPxn5gdu9puT8uR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2011/1/23 Attila Nagy <bra(at)fsn(dot)hu>:
>  Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would
> have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit
> integer (the word:integer is always constant) and store the result
> (encoding)

ok. PostgreSQL allow to do that easily.

> - take the previous result and substitute the integers with words (decoding)

idem.

> - the words should be reference counted, so if a word can be no longer found
> in any of the encoded messages, delete it (and optionally free it's integer
> ID, but 64 bit is believed to be enough for a long time, although having
> smaller IDs result smaller encoded files). This could be achieved by
> informing the database of the words of a deleted message, so it could
> decrement those refcounts and delete the records if needed.

Yes, like what despez do :
http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/

>
> I can easily do this with any RDBMS, with a table of three columns: auto
> incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion
> of records. One scenario would be to store about 40 TBs of words and the
> average word length would be about 50-60 bytes (that's about 800*10^9
> records). It should work well both for inserting and searching (encoding and
> decoding) words.

I strongly suggest you to have a look at intarray contrib (it is
provided with PostgreSQL.

> - I need atomicity and durability, but having these on a word (record) level
> takes too much IOPS and have no use, so it would be good to have an
> interface for inserting about 1000-500000 words in one call, assign a unique
> ID to each unique words and store them (if the word has had already an ID,
> increment its refcount) and give back the IDs for each words. This
> transaction could be committed as one, so the transactions could be big,
> sparing IOPS.

Array allow a very good compression of the data per row. (still it is
not a RDBMS way to use array for that, but it is good for
performances)

> - I need concurrency, so when the above happens from two sources at the same
> time, the same word in the two transactions must get the same ID

one transaction will finish before the other to allow that. (but they
can start at the same time)

>
> Is postgresql a good choice for doing this and if yes, what would be the
> optimal (for both time and space efficiency at encoding and decoding) use
> case?

PostgreSQL should work for that, yes. You'll have to compensate the
size with good hardware and good SQL (and probably some optimization
like using arrays)

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andre Lopes 2011-01-25 22:27:28 Store base64 in database. Use bytea or text?
Previous Message Robert Paresi 2011-01-25 21:28:26 Re: Moving from SQL Anywhere to PostGres - First Time