Re: bitmaps and correlation

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bitmaps and correlation
Date: 2020-01-07 05:26:06
Message-ID: 20200107052606.GP12066@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 07, 2020 at 09:21:03AM +0530, Dilip Kumar wrote:
> On Tue, Jan 7, 2020 at 1:29 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> >
> > Find attached cleaned up patch.
> > For now, I updated the regress/expected/, but I think the test maybe has to be
> > updated to do what it was written to do.
>
> I have noticed that in "cost_index" we have used the indexCorrelation
> for computing the run_cost, not the number of pages whereas in your
> patch you have used it for computing the number of pages. Any reason
> for the same?

As Jeff has pointed out, high correlation has two effects in cost_index():
1) the number of pages read will be less;
2) the pages will be read more sequentially;

cost_index reuses the pages_fetched variable, so (1) isn't particularly clear,
but does essentially:

/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
pages_fetched(MIN) = index_pages_fetched(tuples_fetched,
baserel->pages,
(double) index->pages,
root);
max_IO_cost = pages_fetchedMIN * spc_random_page_cost;

/* min_IO_cost is for the perfectly correlated case (csquared=1) */
pages_fetched(MAX) = ceil(indexSelectivity * (double) baserel->pages);
min_IO_cost = (pages_fetchedMAX - 1) * spc_seq_page_cost;

My patch 1) changes compute_bitmap_pages() to interpolate pages_fetched using
the correlation; pages_fetchedMIN is new:

> Patch
> - pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
> + pages_fetchedMAX = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
> +
> + /* pages_fetchedMIN is for the perfectly correlated case (csquared=1) */
> + pages_fetchedMIN = ceil(indexSelectivity * (double) baserel->pages);
> +
> + pages_fetched = pages_fetchedMAX + indexCorrelation*indexCorrelation*(pages_fetchedMIN - pages_fetchedMAX);

And, 2) also computes cost_per_page by interpolation between seq_page and
random_page cost:

+ cost_per_page_corr = spc_random_page_cost -
+ (spc_random_page_cost - spc_seq_page_cost)
+ * (1-correlation*correlation);

Thanks for looking. I'll update the name of pages_fetchedMIN/MAX in my patch
for consistency with cost_index.

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nagaraj Raj 2020-01-07 06:15:09 pg_repack failure
Previous Message Amit Langote 2020-01-07 05:01:49 Re: adding partitioned tables to publications