Table Design question for gurus (without going to "NoSQL")...

From: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
To: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Table Design question for gurus (without going to "NoSQL")...
Date: 2011-11-20 13:33:47
Message-ID: CAFWfU=sS4C1-FT=EyxjKAtWzqoxMHrDL7+o_unWDbUB9kNwQFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi.

Want to start another thread, loosely related to the performance
problems thread I have going.

Need some DB design guidance from the gurus here.

My big table now has about 70 million rows, with the following columns:

alias | character varying(35)
url | text
modify_date | timestamp without time zone
ip | bigint

For each IP address (user of my application) I want to have a unique
URL. So I used to have a UNIQUE constraint on IP, URL. But the index
based on this became huge, as some URLs are gigantic. so I introduced
an md5 of the URL:

url_md5 | varchar(32)

I now have two scenarios:

1. To have an index (unique?) on "(ip, url_md5)"

2. To not have an index on just the "ip". This way a query that tries
to match "...WHERE ip = 999 AND url_md5 = '<md5 here>'..." will
still look only at the ip bit of the index, then refine it with the
url_md5.

The good thing about #2 is the size of index remains very small with
only a bigint field (ip) being indexed.

The bad thing about #2 is that each query of "...WHERE ip = 999 AND
url_md5 = '<md5 here>'..." will have to refine the indexed IP. If one
IP address has put in a lot of URLs, then this becomes a bit slow. As
is now happening, where I have users who have over 1 million URLs
each!

Questions:

1. Instead of md5, is there any integer hashing algorithm that will
allow me to have a bigint column and save a lot hopefully in both
storage space and speed? (Some very useful points mentioned here:
http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer
)

2. If I do go with the above scenario #1 of a joint index, is there
any way I can save space and maintain speed? Partitioning etc are out
of the question.

With a growing web database, I am sure many people face this
situation. Are nosql type databases the only sane solution to such
massive volumes and throughput expectations (e.g., CouchDb's MemBase)?

Many thanks for any ideas or pointers!

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargentg 2011-11-20 13:50:09 9.1.1 build failure : postgres link fails
Previous Message Phoenix Kiula 2011-11-20 12:46:14 Re: Huge number of INSERTs