From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: estimating # of distinct values |
Date: | 2011-01-20 21:42:45 |
Message-ID: | 4D38AC55.70402@fuzzy.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dne 20.1.2011 03:36, Robert Haas napsal(a):
> On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>> Regarding the crash scenario - if the commit fails, just throw away the
>>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>>> the case when commit succeeds and the write of the merged estimator
>>>> fails, but I think it might be possible to write the estimator to xlog
>>>> or something like that. So it would be replayed during recovery etc. Or
>>>> is it a stupid idea?
>>>
>>> It's not stupid, in the sense that that is what you'd need to do if
>>> you want to avoid ever having to rescan the table. But it is another
>>> thing that I think is going to be way too expensive.
>>
>> Way too expensive? All you need to put into the logfile is a copy of the
>> estimator, which is a few kBs. How is that 'way too expensive'?
>
> At this point, this is all a matter of religion, right? Neither of us
> has a working implementation we've benchmarked. But yes, I believe
> you're going to find that implementing some kind of streaming
> estimator is going to impose a... <pulls number out of rear end> 6%
> performance penalty, even after you've optimized the living daylights
> out of it. Now you might say... big deal, it improves my problem
> queries by 100x. OK, but if you could get the same benefit by doing
> an occasional full table scan during off hours, you could have the
> same performance with a *0%* performance penalty. Even better, the
> code changes would be confined to ANALYZE rather than spread out all
> over the system, which has positive implications for robustness and
> likelihood of commit.
Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).
But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.
> I'm not trying to argue you out of working on this. It's obviously
> your time to spend, and if works better than I think it will, great!
> I'm merely offering you an opinion on what will probably happen if you
> go this route - namely, it'll carry an unpalatable run-time penalty.
> That opinion may be worth no more than what you paid for it, but there
> you have it.
Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.
regards
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Josh Berkus | 2011-01-20 21:48:53 | Re: Orphaned statements issue |
Previous Message | Tom Lane | 2011-01-20 21:41:45 | Re: Large object corruption during 'piped' pg_restore |