Skip site navigation (1) Skip section navigation (2)

self-tuning histograms

From: Neil Conway <nconway(at)klamath(dot)dyndns(dot)org>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: self-tuning histograms
Date: 2002-05-30 03:05:18
Message-ID: 20020529230518.28305191.nconway@klamath.dyndns.org (view raw or flat)
Thread:
Lists: pgsql-hackers
What does everyone think about adding self-tuning histograms
to PostgreSQL?

Briefly, a self-tuning histogram is one that is constructed without
looking at the data in the attribute; it uses the information provided
by the query executor to adjust a default set of histograms that are
created when the table is defined. Thus, the histograms automatically
adapt to the data that is stored in the table -- as the data
distribution changes, so do the histograms.

Histogram refinement can take place in two possible ways: online
(as queries are executed, the histograms are updated immediately),
or offline (the necessary data is written to a log after every
query, which is processed on a regular basis to refine the
histograms).

The paper I've looked at on this topic is "Self-tuning
Histograms: Building Histograms Without Looking at Data", by
Aboulnaga and Shaudhuri (1999), which you can find here:
http://citeseer.nj.nec.com/255752.html -- please refer to
it for lots more information on this technique.

I think that ST histograms would be useful because:

(1) It would make it easier for us to implement multi-dimensional
    histograms (for more info, see the Aboulnaga and Shaudhuri).
    Since no commercial system currently implements them, I
    think this would be a neat thing to have.

(2) I'm unsure of the accuracy of building histograms through
    statistical sampling. My guess would be that ST histograms
    would achieve better accuracy when it matters most -- i.e.
    on those tables accessed the most often (since those are
    the tables for which the most histogram refinement is done).

(3) The need for manual DB maintainence through VACUUM and
    ANALYZE is problematic. This technique would be a step in
    the direction of removing that requirement. Self-tuning
    databases are something a lot of industry players (IBM,
    Microsoft, others) are working toward.

(4) They scale well -- refining histograms on a 100 million
    tuple table is no different than on a 100 tuple table.

There are some disadvantages, however:

(1) Reproduceability: At the moment, the system's performance
    only changes when the data is changed, or the DBA makes a
    configuration change. With this (and other "self-tuning"
    techniques, which are becoming very popular among
    commercial databases), the system can change the state of
    the system without the intervention of the DBA. While I'd
    hope that those changes are for the better (i.e. histograms
    eventually converging toward "perfect" accuracy), that
    won't always be the case. I don't really see a way around
    this, other than letting the DBA disable ST histograms
    when debugging problems.

(2) Performance: As Aboulnaga and Shaudhuri point out, online
    histogram refinement can become a point of contention.
    Obviously, we want to avoid that. I think online refinement
    is still possible as long as we:

        (a) don't block waiting for locks: try to acquire the
            necessary locks to refine the histograms,
            immediately give up if not possible

        (b) delay histogram refinement so it doesn't interfere
            with the user: for example, store histogram data
            locally and only update the system catalogs when
            the backend is idle

        (c) only update the histogram when major changes can
            be applied: skip trivial refinements (or store those
            in the offline log for later processing)

        (d) allow the DBA to choose between offline and online
            histogram refinement (assuming we choose to implement
            both)

Any comments?

Cheers,

Neil

-- 
Neil Conway <neilconway(at)rogers(dot)com>
PGP Key ID: DB3C29FC

Responses

pgsql-hackers by date

Next:From: Gavin SherryDate: 2002-05-30 03:52:08
Subject: Re: self-tuning histograms
Previous:From: Kenneth ChanDate: 2002-05-30 01:40:15
Subject: Re: Polygons passed to poly_overlap have 0 pts when

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group