From: | Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com> |
---|---|
To: | Attila Nagy <bra(at)fsn(dot)hu> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Postgresql as a dictionary coder backend? |
Date: | 2011-01-25 22:13:39 |
Message-ID: | AANLkTi=7hNjidLpqTsurdqNcsX9ZZcPxn5gdu9puT8uR@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
2011/1/23 Attila Nagy <bra(at)fsn(dot)hu>:
> Hello,
>
> I'm looking for a database backend for a dictionary coder project. It would
> have three major tasks:
> - take a text corpus, get their words and substitute each word by a 64 bit
> integer (the word:integer is always constant) and store the result
> (encoding)
ok. PostgreSQL allow to do that easily.
> - take the previous result and substitute the integers with words (decoding)
idem.
> - the words should be reference counted, so if a word can be no longer found
> in any of the encoded messages, delete it (and optionally free it's integer
> ID, but 64 bit is believed to be enough for a long time, although having
> smaller IDs result smaller encoded files). This could be achieved by
> informing the database of the words of a deleted message, so it could
> decrement those refcounts and delete the records if needed.
Yes, like what despez do :
http://www.depesz.com/index.php/2009/07/10/getting-list-of-unique-elements/
>
> I can easily do this with any RDBMS, with a table of three columns: auto
> incremented ID, word and refcount, with a unique index on word.
> The challenge could be:
> - that it should scale to several TBs of size and several (hundred) billion
> of records. One scenario would be to store about 40 TBs of words and the
> average word length would be about 50-60 bytes (that's about 800*10^9
> records). It should work well both for inserting and searching (encoding and
> decoding) words.
I strongly suggest you to have a look at intarray contrib (it is
provided with PostgreSQL.
> - I need atomicity and durability, but having these on a word (record) level
> takes too much IOPS and have no use, so it would be good to have an
> interface for inserting about 1000-500000 words in one call, assign a unique
> ID to each unique words and store them (if the word has had already an ID,
> increment its refcount) and give back the IDs for each words. This
> transaction could be committed as one, so the transactions could be big,
> sparing IOPS.
Array allow a very good compression of the data per row. (still it is
not a RDBMS way to use array for that, but it is good for
performances)
> - I need concurrency, so when the above happens from two sources at the same
> time, the same word in the two transactions must get the same ID
one transaction will finish before the other to allow that. (but they
can start at the same time)
>
> Is postgresql a good choice for doing this and if yes, what would be the
> optimal (for both time and space efficiency at encoding and decoding) use
> case?
PostgreSQL should work for that, yes. You'll have to compensate the
size with good hardware and good SQL (and probably some optimization
like using arrays)
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
From | Date | Subject | |
---|---|---|---|
Next Message | Andre Lopes | 2011-01-25 22:27:28 | Store base64 in database. Use bytea or text? |
Previous Message | Robert Paresi | 2011-01-25 21:28:26 | Re: Moving from SQL Anywhere to PostGres - First Time |