Re: What is the best way to do attribute/values?

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Daniel Ceregatti" <vi(at)sh(dot)nu>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: What is the best way to do attribute/values?
Date: 2004-08-26 00:38:16
Message-ID: NOEFLCFHBPDAFHEIPGBOIEFHCDAA.simon@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Two more unusual suggestions:

1. Drop all the indexes and just do sequential scans (full table scans),
aiming as hard as possible to get the whole people table to fit in memory
(size says that should be easy - various ways) - and make sure you're using
8.0 so you have the best cache manager. This will at least give you
consistent performance on whatever attribute values searched on in user
queries. Dropping all the indexes will allow the query to optimize faster,
since it has only one path choice. Work out how many attributes it takes to
reduce the list of candidates to a manageable number, and include only those
factors into the table, effectively vertically partitioning the table,
thereby reducing the volume and increasing scan speed. Then redesign the
user interface so that they see a two-stage process, first stage is top N
common attributes, second stage is to further reduce that down using rarer
attributes, as well as using the output from the first table to index into
the second. Users then won't mind additional wait time.
(Experiment with: Horizontally partition the table into N pieces. Issue N
simultaneous queries to simulate a parallel query. Try N == 2 on your box)

2. You can try forcing the use of a Star Join optimization here:
Concatenate the attribute values into a single column, then index it. This
will be nearly unique. Cluster the table.
Permute the values of the attributes, to give you a list of concatenated
keys that would match, then join that list to the main table, using a join
via the index.
You can do this permutation by using a reference table per attribute, then
doing an unconstrained product join between all of the attribute tables
(avoid any indexes on them) and assembling the candidate keys into a single
temporary table. Then join the temp table to the main people table. This
will only work effectively if people's attributes are selected with some
discrimination, otherwise this optimisation will fail. You'd need to
constrain the user interface to "pick 20 out of the following 100 attribute
values" or some other heuristic to ensure a relatively low count, or use a
LIMIT on the query into the temp table.
This sounds long winded, but is essentially the route the Oracle optimizer
takes in performing a star join....you clearly know you're Oracle, so look
that up to confirm what I'm saying. (May not work as well if you use a
sub-select on PostgreSQL....)

Also, I'd categorise the Age, Height, Weight and Salary attributes and
everything else based upon most common ranges, so it will be just an
equality search on an integer assigned to that category, rather than a >
search. Order by the distance, don't search on it, it'll be quicker since
you'll only need to calculate it for the records that match...even if you do
get a few too many, it would be a shame to avoid somebody because they lived
1 mile outside of the stated radius.

The database sounds < 1 Gb in total logical volume, so 4Gb of RAM should be
easily sufficient.

Best Regards, Simon Riggs

> -----Original Message-----
> From: pgsql-performance-owner(at)postgresql(dot)org
> [mailto:pgsql-performance-owner(at)postgresql(dot)org]On Behalf Of Daniel
> Ceregatti
> Sent: 19 August 2004 19:03
> To: pgsql-performance(at)postgresql(dot)org
> Subject: [PERFORM] What is the best way to do attribute/values?
>
>
> Hi list,
>
> I have a database with 1M "people" in it. Each person has about 20
> attributes, such as height, weight, eye color, etc. I need to be able to
> search for people based on these attributes. A search can be conducted
> on one attribute, all attributes, or any number in between. How would
> _you_ do this?
>
> I have already attempted to answer this. My attempts are detailed here:
>
> http://sh.nu/email.txt
>
> This is the email I was originally going to send to this list. Since
> it's so large, I decided to link to it instead. If you feel that it
> belongs in a post to the list, let me know, and I'll post again.
>
> I've discussed these attempts with people in #postgresql on
> irc.freenode.net. Agliodbs (I presume you know who this is) was very
> helpful, but in end was at a loss. I find myself in the same postition
> at this time. He suggested I contact this list.
>
> My ultimate goal is performance. This _must_ be fast. And by fast, I
> mean, < 1 second, for every permutation of the number of attributes
> searched for. Flexibility would be a bonus, but at this point I'll
> settle for something that's harder to maintain if I can get the speed
> gain I need.
>
> Thanks,
>
> Daniel Ceregatti
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Christopher Kings-Lynne 2004-08-26 01:16:17 Re: Equivalent praxis to CLUSTERED INDEX?
Previous Message Josh Berkus 2004-08-25 23:50:53 Re: TSearch2 and optimisation ...