Re: Parallel Index Scans

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Subject: Re: Parallel Index Scans
Date: 2017-02-12 01:44:21
Message-ID: CAA4eK1Lyn_7FF6Q1MN9HApQk+w7YptaTX86vxEpTDAGaw4CNvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> The hunk in indexam.c looks like a leftover that can be omitted.
>>
>> It is not a leftover hunk. Earlier, the patch has the same check
>> btparallelrescan, but based on your comment up thread [1] (Why does
>> btparallelrescan cater to the case where scan->parallel_scan== NULL?),
>> this has been moved to indexam.c.
>
> That's not my point. The point is, if it's not a parallel scan,
> index_parallelrescan() should never get called in the first place. So
> therefore it shouldn't need to worry about scan->parallel_scan being
> NULL. It seems possibly reasonable to put something like
> Assert(scan->parallel_scan != NULL) in there, but arranging to return
> without doing anything in that case seems like it just masks possible
> bugs in the calling code. And really I'm not sure we even need the
> Assert.
>

I have removed the check from index_parallelrescan() and ensured in
the caller that it must be called only when parallel_scan descriptor
is present.

Comments from another e-mail:
> This design presupposes that every AM that might ever want to do
> parallel scans is happy with genericcostestimate()'s method of
> computing the number of index pages that will be fetched. However,
> that might not be true for every possible AM. In fact, it's already
> not true for BRIN, which always reads the whole index. Now, since
> we're only concerned with btree at the moment, nothing would be
> immediately broken by this approach but it seems we're just setting
> ourselves up for future pain.
>

I think to make it future-proof, we could add a generic page
estimation function. However, I have tried to implement based on your
below suggestion, see if that looks better to you, otherwise we can
introduce a generic page estimation API.

> I have what I think might be a better idea: let's make
> amcostestimate() responsible for returning a suggested parallel degree
> for the path via an additional out parameter. cost_index() can then
> reduce that value if it seems like not enough heap pages will be
> fetched to justify the return value, or it can override it completely
> if parallel_degree is set for the relation.
>

I see the value of your idea, but I think it might be better to return
the number of IndexPages required to be scanned from amcostestimate()
and then use the already computed value of heap_pages in cost_index to
identify the number of workers. This will make the calculation simple
and avoid overriding the return value. Now, the returned value of
index pages will only be used for cases when partial path is being
constructed, but I think that is okay, because we are not doing any
extra calculation to compute the number of index pages fetched.
Another argument could be that the number of index pages to be used
for parallelism can be different from the number of pages to be
scanned or what ever is used in cost computation, but I think that is
also easy to change later when we create partial paths for indexes
other than btree. I have implemented the above idea in the attached
patch (parallel_index_opt_exec_support_v9.patch)

> Then we don't need to run
> this logic twice to compute the same value, and we don't violate the
> AM abstraction layer.
>

We can avoid computing the same value twice, however, with your
suggested approach, we have to do all the additional work for the
cases where employing parallel workers is not beneficial, so not sure
if there is a net gain.

> BTW, the changes to add_partial_path() aren't needed, because an
> IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
> top of it, and that won't be the case with this or any other pending
> patch. If we get the ability to have a Parallel Bitmap Heap Scan that
> takes a parallel index scan rather than a standard index scan as
> input, then we'll need something like this. But for now it's probably
> best to just update the comments and remove the Assert().
>

Right, changed as per suggestion.

> I think you can also leave out the documentation changes from these
> patches. I'll do some general rewriting of the parallel query section
> once we know exactly what capabilities we'll have in this release; I
> think that will work better than trying to update them a bit at a time
> for each patch.
>

Okay, removed the documentation part.

Patches to be used: guc_parallel_index_scan_v1.patch [1],
parallel_index_scan_v8.patch, parallel_index_opt_exec_support_v9.patch

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BTnM4pXQbvn7OXqam%2Bk_HZqb0ROZUMxOiL6DWJYCyYow%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
parallel_index_scan_v8.patch application/octet-stream 26.8 KB
parallel_index_opt_exec_support_v9.patch application/octet-stream 37.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-02-12 02:49:42 Re: amcheck (B-Tree integrity checking tool)
Previous Message Tom Lane 2017-02-12 00:30:35 Re: Improve OR conditions on joined columns (common star schema problem)