Re: Collect frequency statistics for arrays

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-04 18:24:40
Message-ID: 16932.1330885480@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Sun, Mar 4, 2012 at 5:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 2. The tests in the above-mentioned message show that in most cases
>> where mcelem_array_contained_selec falls through to the "rough
>> estimate", the resulting rowcount estimate is just 1, ie we are coming
>> out with very small selectivities. Although that path will now only be
>> taken when there are no stats, it seems like we'd be better off to
>> return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there
>> must be something wrong with the "rough estimate" logic. Could you
>> recheck that?

> I think the wrong think with "rough estimate" is that assumption about
> independent occurrences of items is very unsuitable even for "rough
> estimate". The following example shows that "rough estimate" really works
> in the case of independent occurrences of items. ...
> It this particular case "rough estimate" is quite accurate. But in most
> part of cases it behaves really bad. It is why I started to invent
> calc_distr and etc. So, I think return DEFAULT_CONTAIN_SEL is OK unless
> we've some better ideas.

OK. Looking again at that code, I notice that it also punts and returns
DEFAULT_CONTAIN_SEL if it's not given MCELEM stats, which it more or
less has to because without even a minfreq the whole calculation is just
hot air. And there are no plausible scenarios where compute_array_stats
would produce an MCELEM slot but no count histogram. So that says there
is no point in sweating over this case, unless you have an idea how to
produce useful results without MCELEM.

So I think it's sufficient to punt at the top of the function if no
histogram, and take out the various attempts to cope with the case.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message hubert depesz lubaczewski 2012-03-04 18:37:19 Re: Our regex vs. POSIX on "longest match"
Previous Message Dimitri Fontaine 2012-03-04 18:08:18 Re: Command Triggers, patch v11