OT: Data structure design question: How do they count so fast?

From: Brendan Duddridge <brendan(at)clickspace(dot)com>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: OT: Data structure design question: How do they count so fast?
Date: 2006-04-09 05:49:10
Message-ID: A8E4EC08-DDA4-4262-9CA6-0C9F65C9793A@clickspace.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

First of all, the reason I'm posting on the PostgreSQL Performance
list is we have a performance issue with one of our applications and
it's related to the speed at which PostgreSQL can do counts. But it's
also related to the data structure we've designed to develop our
comparison shopping engine. It's very normalized and the speed of our
queries are slowing down considerably as we add more and more data.

So I've been looking at some of the other comparison shopping engines
and I'm trying to figure out how they manage to get the counts of
their products for a set of attributes and attribute values so quickly.

For example, on the following page, they have 1,260,658 products:

http://www.mysimon.com/Home-Furnishings/9000-10975_8-0.html?tag=glnav

They display 3 attributes with values on the page: Price Range, Home
Furnishing Type, and Store. Plus there are a selection of other
attributes not displaying the value choices.

For Price Range, they have the following values and product counts
(in brackets):

# Below $20 (204,315)
# $20 - $50 (234,694)
# $50 - $80 (188,811)
# $80 - $130 (182,721)
# $130 - $240 (222,519)

For Home Furnishing Type they have:

# Wall Art and Decor (438,493)
# Lighting (243,098)
# Bathroom Furnishings (132,441)
# Rugs (113,216)
# Decorative Accents (65,418)

And for Store they have:

# Art.com (360,933)
# HomeAnnex (130,410)
# AllPosters.com (72,529)
# HomeClick.com (61,423)
# 1STOPlighting Superstore (32,074)

Now, initially I thought they would just pre-compute these counts,
but the problem is, when you click on any of the above attribute
values, they reduce the remaining possible set of matching products
(and set of possible remaining attributes and attribute values) by
the amount displayed next to the attribute value selected. You can
click on any combination of attribute values to filter down the
remaining set of matching products, so there's a large combination of
paths you can take to arrive at a set of products you might be
interested in.

Do you think they are pre-computed? Or do you think they might use a
query similar to the following?:

select pav.attribute_value_id, count(p.product_id)
from product_attribute_value pav,
attribute a,
product p
where a.attribute_id in (some set of attribute ids) and
pav.product_id = p.product_id and
pav.attribute_id = a.attribute_id and p.product_id in
(select product_id
from category_product
where category_id = some category id) and
p.is_active = 'true'
group by pav.attribute_value_id;

It would seem to me that although the above query suggests a
normalized database structure, that joining with 3 tables plus a 4th
table in the sub-query with an IN qualifier and grouping to get the
product counts would take a VERY long time, especially on a possible
result set of 1,260,658 products.

The other issue is trying to figure out what the remaining set of
possible attribute and attribute values are in order to reach the
remaining set of products.

Does anyone have any insights into what kind of data structures would
be necessary to accomplish such a feat? I know that PostgreSQL has
performance issues with counting rows, so I can't imagine being able
to use the above kind of query to get the results we need. I don't
know what kind of database backend mysimon is using either. It would
also seem to be logical that having a flattened data structure would
seem to be necessary in order to get the performance required. Their
pages seem to load pretty fast.

We are possibly considering some kind of pre-computed decision-tree
type data structure to get the counts of the set of products that
could be reached by selecting any combination of attributes and
attribute values. Does this seem like a reasonable idea?

Perhaps there's a better way?

I also apologize if this isn't the appropriate list to publish this
kind of question to. The reason I posted here was because it is a
performance related question, but not necessarily directly related to
PostgreSQL. It's just that we are using PostgreSQL for our product
comparison engine so I thought there could be some PostgreSQL
specific optimizations that could be made. If not, please let me know
and I'll move it elsewhere.

Thanks very much,

____________________________________________________________________
Brendan Duddridge | CTO | 403-277-5591 x24 | brendan(at)clickspace(dot)com

ClickSpace Interactive Inc.
Suite L100, 239 - 10th Ave. SE
Calgary, AB T2G 0V9

http://www.clickspace.com

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Chethana, Rao (IE10) 2006-04-09 09:27:57 pls reply ASAP
Previous Message Bruce Momjian 2006-04-09 03:26:51 Re: Indexes with descending date columns