I have following common situation:
Category IDs: about 50 000
Document IDs: about 3 000 000
Many to many relationship.
A document id have a relation with 10 up to 1000 category ids
One query, with input set of document ids, resulting set of category ids, having relation with input ids. (very often this query is called with more than 500k of input document ids)
I use custom written datawarehouse, file storage, and memory kept "id offset" like indecies. So a query for all 3 million document ids, resulting almost all category ids take less than a second on desktop machine. Abstracting from concrete realization, query plan is like:
1. for each input document id: look up an array by document id and retrive a list of ralated category ids.
1.1 for each category id in the list: look up an array value by category id and mark it as found
2. traverse category array to extract category ids marked as found
I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over them, but best achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more than 5 times.
I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current transaction. Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I download HEAD revision I will find them there, dont know.
Anyway, I want to ask, have anyone faced similar situation, and is there any way to achive closer to optimal performance using postgresql functionality and extensibility?
pgsql-performance by date
|Next:||From: Chavdar Kopoev||Date: 2008-11-26 15:38:07|
|Subject: many to many performance|
|Previous:||From: Mario Weilguni||Date: 2008-11-26 14:15:29|
|Subject: Re: Increasing pattern index query speed|