Re: Slow query performance

From: "Nick Mellor" <nick(dot)mellor(dot)groups(at)pobox(dot)com>
To: "'Kevin Galligan'" <kgalligan(at)gmail(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Slow query performance
Date: 2008-10-31 09:13:00
Message-ID: D6E96BB28F3B418486D28CE52D9C4ED7@HAMISH
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Kevin,

I'm not deeply knowledgeable about PostgreSQL, but my guess is that 2 things
are doing you in:

(1) scanning all those nulls during SELECTs (even though PostgreSQL is
efficient at nulls, there are still tens or hundreds of billions of them)

(2) All those single-field indexes, and aggregations between them

Both (1) and (2) make it very difficult for PG to cache effectively.

You have the advantage that your data is read-only at query time, so I'd
suggest this (fairly lengthy) experiment:

Go back to a split-table format, where you have:

- one rectangular table containing those fields which are always, or nearly
always, filled (the "Full" table.) You talk about there being "some knowns".
Does that mean that you know some combinations of two or more fields will be
selected on very frequently? Optimise any of these combinations in the
"full" table fields using multi-field indexes across these combinations
(e.g. (state,age).) Put full single-field indexes on all fields in the
"Full" table.

- one or more tables (one initially, see below) containing the fields that
are mostly null (the "Sparse" table.) Store the sparse data in a "depivoted"
form. Explanation follows:

"Sparse" Table
---

Instead of (the original "sparse" table):

Id,field1,field2,field3,field4,...field500
Rec1,...
Rec2,...
...
RecM,...

store the sparse data as:

Id,fieldname,value
Rec1,field1name,field1value(Rec1)
Rec1,field2name,field1value(Rec2)
...
Rec1,field500name,field500value(Rec1)
Rec2,field1name,field1value(Rec2)
...
RecM,field500name,field500value(RecM)

I.e. one row per cell in the "sparse" table, and an Id to link to the "Full"
table.

For null values, don't store a depivoted record at all. I'd estimate this
would give you a few billion rows at most in this table.

(If anyone has a better name for this process than "depivoting", please pass
it along!)

In the depivoted table, put single-field indexes and multi-field indexes on
every combination of Id, fieldname, value in the depivoted data:
(Id)
(fieldname)
(value)
(Id,fieldname)
(Id,value)
(fieldname,value)

You might eventually have to keep a different depivoted table for each type
of field value (boolean, integer, character varying etc) but you could do a
dirty experiment by converting all values to CHARACTER VARYING and having a
look at query performance using the new structure before doing further work.

Rationale
---

"Depivoting" makes sense to me because your data is so sparse. The huge
number of nulls may be causing severe aggregation bottlenecks and many cache
misses. All those indexes will absolutely kill the query performance of any
database in terms of hard disk seeks, data cache limits and aggregation
time. I'm not surprise that both MySQL and PostgreSQL struggle.

Too many indexes probably gave you a diminishing return on a table this big
because 15G RAM on your server is way too small a cache for so many fields
and indexes.

"Depivoting" eliminates the need to scan the huge number of nulls in the
dataset. Even if nulls are very efficiently handled in PostgreSQL, you're
talking about hundreds of billions of them, and hundreds of billions of
anything is never going to be quick. Better not to process them at all.

"De-pivoting" will (of course) eventually mean rewriting all your querying
code, and you'll probably need to "rebuild" the sparse data into a wide
table format at some point. But if the result set is small, this should be a
small price to pay for better SELECT query performance.

If you want to do a quick and dirty experiment, I have an MS Access app that
depivots arbitrarily wide tables. I'd be glad to pass it along, although
there is a risk it's too small a gun for the job. But if Access manages okay
with the depivoted table, it might be worth a try. Based on 500 fields, 250M
records, 2% filled it looks like it might depivot your table overnight, or
better. You'd finish with about 2.5 billion rows.

Best wishes,

Nick

> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
> owner(at)postgresql(dot)org] On Behalf Of Kevin Galligan
> Sent: Thursday, 30 October 2008 7:18 AM
> To: pgsql-general(at)postgresql(dot)org
> Subject: [GENERAL] Slow query performance
>
> I'm approaching the end of my rope here. I have a large database.
> 250 million rows (ish). Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.
>
> What I'm trying to do is, essentially, search for sub-sets of that
> data based on arbitrary queries of those data columns. the queries
> would be relatively simple ("dirbtl is not null and qqrq between 20
> and 40"). After the database is built, it is read only.
>
> So, I started with maybe 10-15 fields in a main table, as most records
> have values for those fields. Then had individual tables for the
> other values. The idea is that the percentage of rows with values
> drops off significantly after those main tables. That, an each
> individual query looks at probably 3 or 4 fields in total. The
> performance of those queries was pretty bad. Its got to join large
> numbers of values, which didn't really work out well.
>
> So, went the other direction completely. I rebuilt the database with
> a much larger main table. Any values with 5% or greater filled in
> rows were added to this table. Maybe 130 columns. Indexes applied to
> most of these. Some limited testing with a smaller table seemed to
> indicate that queries on a single table without a join would work much
> faster.
>
> So, built that huge table. now query time is terrible. Maybe a
> minute or more for simple queries.
>
> I'm running vacuum/analyze right now (which is also taking forever, BTW).
>
> The box has 15 g of ram. I made the shared_buffers setting to 8 or 9
> gig. My first question, what would be better to bump up to increase
> the performance? I thought that was the field to jack up to improve
> query time or index caching, but I've read conflicting data. The 15
> ram is available.
>
> I originally had this in mysql. Also bad performance. I understand
> how to optimize that much better, but it just wasn't cutting it.
>
> Anyway, help with tuning the settings would be greatly appreciated.
> Advice on how best to lay this out would also be helpful (I know its
> difficult without serious detail).
>
> Thanks in advance,
> -Kevin
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Attachment Content-Type Size
Nick Mellor (nick.mellor.groups@pobox.com).vcf text/x-vcard 703 bytes

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2008-10-31 10:10:20 Re: tsearch2 problem
Previous Message Gregory Stark 2008-10-31 08:49:56 Re: Are there plans to add data compression feature to postgresql?