Re: estimating # of distinct values

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: estimating # of distinct values
Date: 2011-01-20 21:42:45
Message-ID: 4D38AC55.70402@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 20.1.2011 03:36, Robert Haas napsal(a):
> On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>> Regarding the crash scenario - if the commit fails, just throw away the
>>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>>> the case when commit succeeds and the write of the merged estimator
>>>> fails, but I think it might be possible to write the estimator to xlog
>>>> or something like that. So it would be replayed during recovery etc. Or
>>>> is it a stupid idea?
>>>
>>> It's not stupid, in the sense that that is what you'd need to do if
>>> you want to avoid ever having to rescan the table. But it is another
>>> thing that I think is going to be way too expensive.
>>
>> Way too expensive? All you need to put into the logfile is a copy of the
>> estimator, which is a few kBs. How is that 'way too expensive'?
>
> At this point, this is all a matter of religion, right? Neither of us
> has a working implementation we've benchmarked. But yes, I believe
> you're going to find that implementing some kind of streaming
> estimator is going to impose a... <pulls number out of rear end> 6%
> performance penalty, even after you've optimized the living daylights
> out of it. Now you might say... big deal, it improves my problem
> queries by 100x. OK, but if you could get the same benefit by doing
> an occasional full table scan during off hours, you could have the
> same performance with a *0%* performance penalty. Even better, the
> code changes would be confined to ANALYZE rather than spread out all
> over the system, which has positive implications for robustness and
> likelihood of commit.

Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).

But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.

> I'm not trying to argue you out of working on this. It's obviously
> your time to spend, and if works better than I think it will, great!
> I'm merely offering you an opinion on what will probably happen if you
> go this route - namely, it'll carry an unpalatable run-time penalty.
> That opinion may be worth no more than what you paid for it, but there
> you have it.

Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2011-01-20 21:48:53 Re: Orphaned statements issue
Previous Message Tom Lane 2011-01-20 21:41:45 Re: Large object corruption during 'piped' pg_restore