Re: estimating # of distinct values

From: Jim Nasby <jim(at)nasby(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: estimating # of distinct values
Date: 2011-01-01 03:24:26
Message-ID: AD87588E-8BCE-403C-B8F7-871C9182157F@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010:
>> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>>> I was thinking that we could have two different ANALYZE modes, one
>>> "full" and one "incremental"; autovacuum could be modified to use one or
>>> the other depending on how many changes there are (of course, the user
>>> could request one or the other, too; not sure what should be the default
>>> behavior).
>>
>> How is an incremental ANALYZE going to work at all? It has no way to
>> find out the recent changes in the table, for *either* inserts or
>> deletes. Unless you want to seqscan the whole table looking for tuples
>> with xmin later than something-or-other ... which more or less defeats
>> the purpose.
>
> Yeah, I was thinking that this incremental ANALYZE would be the stream
> in the "stream-based estimator" but evidently it doesn't work that way.
> The stream that needs to be passed to the estimator consists of new
> tuples as they are being inserted into the table, so this would need to
> be done by the inserter process ... or it'd need to transmit the CTIDs
> for someone else to stream them ... not an easy thing, in itself.

Perhaps listen/notify could be used for this, now that it allows passing a payload.

BTW, if we reduce the frequency at which full scans of large tables are needed then presumably the cost of the scans could be largely ignored. If we don't need to scan frequently then we shouldn't care very much about how long a scan takes, which means we could throttle it heavily. Presumably even a heavily used system can spare 500kB/s of IO to perform background scanning...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-01-01 05:05:30 Re: and it's not a bunny rabbit, either
Previous Message Josh Berkus 2011-01-01 00:20:54 Re: Sync Rep Design