Re: Yet another abort-early plan disaster on 9.3

From: Craig James <cjames(at)emolecules(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-10-10 14:21:05
Message-ID: CAFwQ8rd0C47ZkCGONPa1UfJg2nJzyi-7kAQ6-jjapV+GU-Mo0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> > I've gone looking for papers on this topic but from what I read this
> > isn't so. To get any noticeable improvement you need to read 10-50% of
> > the table and that's effectively the same as reading the entire table
> > -- and it still had pretty poor results. All the research I could find
> > went into how to analyze the whole table while using a reasonable
> > amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.
>

We've solved this problem using an external (non-Postgres) dynamically
optimizing index. In addition to the "early abort," we also require an
efficient "late start", the equivalent of "offset 100 limit 10". It's a
common problem for web sites that let users page through data with just a
tiny amount of state information (a cookie).

Our index is for chemical structures. Chemicals are indexed on chemical
fragments <http://emolecules.com/info/molecular-informatics>. A search
typically starts with 50-200 indexed "columns" (chemical fragments). The
query is always flat, "A and B and ... and Z". The indexed fragments are
both correlated (the existence of one strongly raises the chances of
another) and anti-correlated (certain combinations are very rare).

The dynamic optimizer watches the performance of each index in real time.
It promotes highly selective indexes and demotes or removes redundant
indexes. In a typical query, the initial 50-200 indexes are reduced to 5-10
indexes within the first 100-200 rows examined. The remaining indexes have
little correlation yet retain most of the selectivity. (One critical factor
with a dynamic optimizer is that the data must be randomized before it's
presented to the optimizer. Databases tend to have clusters of similar
data. If the optimizer starts in such a cluster, it will optimize poorly.)

Our query is simple (a flat AND) compared to what Postgres has to handle.
Even so, a dynamic optimizer is the only effective solution.

Static planners simply can't handle the "early abort" condition, even with
good statistics. Many have pointed out that data are "lumpy" rather than
well distributed. A more subtle problem is that you can have evenly
distributed data, but badly distributed correlations. "Agnes" and "Bob" may
be names that are distributed well in a real-estate database, but it might
happen that all of the information about homes whose owners' names are
"Agnes" and "Bob" occurs at the very end of all of your data because they
just got married and bought a house.

The end result is that even with perfect statistics on each column, you're
still screwed. The combinatorial explosion of possible correlations between
indexes is intractable.

Craig

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-10-10 14:41:39 Re: Wait free LW_SHARED acquisition - v0.9
Previous Message Andres Freund 2014-10-10 14:11:27 Re: [9.4 bug] The database server hangs with write-heavy workload on Windows

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2014-10-10 16:53:51 Re: Yet another abort-early plan disaster on 9.3
Previous Message Tomas Vondra 2014-10-10 12:10:13 Re: Yet another abort-early plan disaster on 9.3