FW: Slow query performance

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: FW: Slow query performance
Date: 2008-10-29 22:38:25
Message-ID: D425483C2C5C9F49B5B7A41F89441547010011E7@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

From: Kevin Galligan [mailto:kgalligan(at)gmail(dot)com]
Sent: Wednesday, October 29, 2008 3:16 PM
To: Dann Corbit
Subject: Re: [GENERAL] Slow query performance

Columns are as follows:

account | integer |
city | character varying(20) |
zip | character(5) |
dincome | character(1) |
sex | character(1) |
mob | boolean |
religion | character(1) |
groupcd | character(1) |
lastdata | character varying(4) |
countycd | character varying(3) |
state | character varying(2) |
dutype | character varying(1) |
orders | integer |
countysz | character varying(1) |
ethnic | character varying(2) |
language | character varying(2) |
cbsacode | character varying(5) |
cbsatype | character varying(1) |
age | smallint |
dob | date |
ccard | boolean |
lor | integer |
bankcard | boolean |
lastord | date |
total | integer |
lmob | boolean |
homeown | character varying(1) |
ord1st | date |
ordlast | date |
married | boolean |
deptcard | boolean |
>>

You did not show us the indexes.

If you have pgadmin III, go to the table and copy/paste the actual
definition, including indexes.

<<
>From here its about another 100 columns with either booleans or
smallints, mostly null values.

I eventually killed the vacuum. I will run it again, but was just going
through the indexes. All were of this format...

"INFO: index "idx_jordlast" now contains 265658026 row versions in
728409 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.88s/0.13u sec elapsed 90.38 sec."

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state
= 'NY';
>>

Is there an index on state and age?

Is there an index on state?

Is there an index on age?

That is important missing information.

If there is no index on either column, then you will do a table scan.

If all of your slow queries look like the above, then create a clustered
index on state,age

<<
I know count is not optimized on postgresql like it is on mysql due to
transaction isolation (at least that's what I've read. Makes sense to
me). I understand it'll take time to actually count the rows. However,
here's the output of 'explain analyze select count(*) from bigdatatable
where age between 22 and 40 and state = 'NY';'

Aggregate (cost=5179639.55..5179639.56 rows=1 width=0) (actual
time=389529.895..389529.897 rows=1 loops=1)
-> Bitmap Heap Scan on bigdatatable (cost=285410.65..5172649.63
rows=2795968 width=0) (actual time=6727.848..387159.175
rows=2553273 loops=1)
Recheck Cond: ((state)::text = 'NY'::text)
Filter: ((age >= 22) AND (age <= 40))
-> Bitmap Index Scan on idx_jstate (cost=0.00..284711.66
rows=15425370 width=0) (actual time=6298.950..6298.950 ro
ws=16821828 loops=1)
Index Cond: ((state)::text = 'NY'::text)
Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, which is
fine. then it looks like "Aggregate" suddenly jumps up to 6 minutes.

I know the database design is crude. Its really just a big flat table.
I didn't put too much into weeding out which columns should be indexed
and which shouldn't (just slapped an index on each). Happy to do that
work, but right now I'm in panic mode and just need to figure out which
way to start going. I had a design on mysql which worked pretty good at
10 to 20 % of full size, but degraded quite a bit at full size.
compounding this is there is another implementation we've seen that uses
the full size of similar data and returns actual results in seconds (I
originally planned to used a 5% size db for estimated results, then the
full size for getting the actual data. This plan was rejected :(

Any thoughts? It seemed to work OK when I had a table with 10 cols but
about the same data length. That may have been an artificial test,
though.

Again. This is read-only once the data is set up. Client wants to run
pretty much arbitrary queries, so its hard to isolate certain things for
optimization, although there are some "knowns".

Will start the full vacuum process again.

Thanks in advance,
-Kevin

On Wed, Oct 29, 2008 at 4:52 PM, Dann Corbit <DCorbit(at)connx(dot)com> wrote:
>> -----Original Message-----
>> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
>> owner(at)postgresql(dot)org] On Behalf Of Kevin Galligan
>> Sent: Wednesday, October 29, 2008 1:18 PM
>> To: pgsql-general(at)postgresql(dot)org
>> Subject: [GENERAL] Slow query performance
>>
>> I'm approaching the end of my rope here. I have a large database.
>> 250 million rows (ish). Each row has potentially about 500 pieces of
>> data, although most of the columns are sparsely populated.
>>
>> What I'm trying to do is, essentially, search for sub-sets of that
>> data based on arbitrary queries of those data columns. the queries
>> would be relatively simple ("dirbtl is not null and qqrq between 20
>> and 40"). After the database is built, it is read only.
>>
>> So, I started with maybe 10-15 fields in a main table, as most
records
>> have values for those fields. Then had individual tables for the
>> other values. The idea is that the percentage of rows with values
>> drops off significantly after those main tables. That, an each
>> individual query looks at probably 3 or 4 fields in total. The
>> performance of those queries was pretty bad. Its got to join large
>> numbers of values, which didn't really work out well.
>>
>> So, went the other direction completely. I rebuilt the database with
>> a much larger main table. Any values with 5% or greater filled in
>> rows were added to this table. Maybe 130 columns. Indexes applied
to
>> most of these. Some limited testing with a smaller table seemed to
>> indicate that queries on a single table without a join would work
much
>> faster.
>>
>> So, built that huge table. now query time is terrible. Maybe a
>> minute or more for simple queries.
>>
>> I'm running vacuum/analyze right now (which is also taking forever,
>> BTW).
>>
>> The box has 15 g of ram. I made the shared_buffers setting to 8 or 9
>> gig. My first question, what would be better to bump up to increase
>> the performance? I thought that was the field to jack up to improve
>> query time or index caching, but I've read conflicting data. The 15
>> ram is available.
>>
>> I originally had this in mysql. Also bad performance. I understand
>> how to optimize that much better, but it just wasn't cutting it.
>>
>> Anyway, help with tuning the settings would be greatly appreciated.
>> Advice on how best to lay this out would also be helpful (I know its
>> difficult without serious detail).
>
> Show the full schema for your table and the output of:
> VACUUM VERBOSE ANALYZE <your_table>
>
> Show the query that is slow.
>
>

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dave Page 2008-10-29 22:54:09 Re: using plpgsql debuggers
Previous Message Kevin Galligan 2008-10-29 22:35:58 Re: Slow query performance