Re: MCV lists for highly skewed distributions

From: John Naylor <jcnaylor(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2017-12-29 14:05:28
Message-ID: CAJVSVGWm9QgkbpgbZ74GCLD=UdSrYo4aq+tBdAU9foR-N+1NNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/28/17, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I want to revive a patch I sent couple years ago to the performance list,
> as I have seen the issue pop up repeatedly since then.

> If we stored just a few more values, their inclusion in the MCV would mean
> they are depleted from the residual count, correctly lowering the estimate
> we would get for very rare values not included in the sample.
>
> So instead of having the threshold of 1.25x the average frequency over all
> values, I think we should use 1.25x the average frequency of only those
> values not already included in the MCV, as in the attached.

After reading the thread you mentioned, I think that's a good approach.

On master, the numerator of avg_count is nonnull_cnt, but you've
(indirectly) set it to values_cnt. You didn't mention it here, but I
think you're right and master is wrong, since in this function only
sortable values go through the MCV path. If I understand the code
correctly, if there are enough too-wide values in the sample, they
could bias the average and prevent normal values from being considered
for the MCV list. Am I off base here?

Anyway, since you're now overwriting ndistinct_table, I would rename
it to ndistinct_remaining for clarity's sake.

This part doesn't use any loop variables, so should happen before the loop:

+ if (num_mcv > track_cnt)
+ num_mcv = track_cnt;

I think this comment
/* estimate # of occurrences in sample of a typical nonnull value */
belongs just above the calculation of avg_count.

> I think that perhaps maxmincount should also use the dynamic
> values_cnt_remaining rather than the static one. After all, things
> included in the MCV don' get represented in the histogram. When I've seen
> planning problems due to skewed distributions I also usually see redundant
> values in the histogram boundary list which I think should be in the MCV
> list instead. But I have not changed that here, pending discussion.

I think this is also a good idea, but I haven't thought it through. If
you don't go this route, I would move this section back out of the
loop as well.

-John Naylor

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2017-12-29 14:11:12 Re: Basebackups reported as idle
Previous Message Vladimir Svedov 2017-12-29 13:57:33 array_ndims never returns zero