Logarithmic data frequency distributions and the query planner

From: Jerry Gamache <jerry(dot)gamache(at)idilia(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Logarithmic data frequency distributions and the query planner
Date: 2010-07-07 20:54:48
Message-ID: 4C34E998.3020000@idilia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 8.1, I have a very interesting database where the distributions of
some values in a multi-million rows table is logarithmic (i.e. the most
frequent value is an order of magnitude more frequent than the next
ones). If I analyze the table, the statistics become extremely skewed
towards the most frequent values and this prevents the planner from
giving any good results on queries that do not target these entries.

In a recent case, the planner estimated that the number of returned rows
would be ~13% of the table size and from this bad assumption generated a
very slow conservative plan that executed in days. If I set the
statistics at zero for that table, the planner uses a hardcoded ratio
(seems like 0.5%) for the number of returned rows and this helps
generating a plan that executes in 3 minutes (still sub-optimal, but not
as bad).

Generating partial index for the less frequent cases helps, but this
solution is not flexible enough for our needs as the number of complex
queries grow. We are mostly left with pre-generating a lot of temporary
tables whenever the planner over-evaluates the number of rows generated
by a subquery (query execution was trimmed from 3 minutes to 30 seconds
using this technique) or using the OFFSET 0 tweak, but it would be nice
if the planner could handle this on its own.

Am I missing something obvious? Setting the statistics for this table to
zero seems awkward even if it gives good results.
Jerry.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2010-07-07 21:22:00 Re: Logarithmic data frequency distributions and the query planner
Previous Message Robert Haas 2010-07-07 20:49:39 Re: big data - slow select (speech search)