Re: Parallel Index Scans

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Subject: Re: Parallel Index Scans
Date: 2017-02-01 13:20:29
Message-ID: CAA4eK1JYBFRqXDwetc6ik0NTEUY+R=VPU7Ydoo45JcAdvXPW5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> In spite of being careful, I missed reorganizing the functions in
>>> genam.h which I have done in attached patch.
>>
>> Cool. Committed parallel-generic-index-scan.2.patch. Thanks.
>
> Reviewing parallel_index_scan_v6.patch:
>
> I think it might be better to swap the return value and the out
> parameter for _bt_parallel_seize; that is, return a bool, and have
> callers ignore the value of the out parameter (e.g. *pageno).
>

makes sense, so changed accordingly.

> I think _bt_parallel_advance_scan should be renamed something that
> includes the word "keys", like _bt_parallel_advance_array_keys.
>

Changed as per suggestion.

> The hunk in indexam.c looks like a leftover that can be omitted.
>

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1] (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.

> +/*
> + * Below flags are used to indicate the state of parallel scan.
>
> They aren't really flags any more; they're members of an enum. I
> think you could just leave this sentence out entirely and start right
> in with descriptions of the individual values. But maybe all of those
> descriptions should end in a period (instead of one ending in a period
> but not the other three) since they read like complete sentences.
>
> + * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
>
> Initializing -> initialize
>
> + * btparallelrescan() -- reset parallel scan
>
> Previous two prototypes have one dash, this one has two. Make it
> consistent, please.
>
> + * Ideally, we don't need to acquire spinlock here, but being
> + * consistent with heap_rescan seems to be a good idea.
>
> How about: In theory, we don't need to acquire the spinlock here,
> because there shouldn't be any other workers running at this point,
> but we do so for consistency.
>
> + * _bt_parallel_seize() -- returns the next block to be scanned for forward
> + * scans and latest block scanned for backward scans.
>
> I think the description should be more like "Begin the process of
> advancing the scan to a new page. Other scans must wait until we call
> bt_parallel_release() or bt_parallel_done()." And likewise
> _bt_parallel_release() should say something like "Complete the process
> of advancing the scan to a new page. We now have the new value for
> btps_scanPage; some other backend can now begin advancing the scan."
> And _bt_parallel_done should say something like "Mark the parallel
> scan as complete."
>

Changed the code as per your above suggestions.

> I am a bit mystified about how this manages to work with array keys.
> _bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
> unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
> _bt_parallel_advance_scan() won't do anything unless btps_pageStatus
> is already BTPARALLEL_DONE.
>

This is just to ensure that btps_arrayKeyCount is advanced once and
btps_pageStatus is changed to BTPARALLEL_DONE once per array element.
So it goes something like, if we have array with values [1,2,3], then
all the workers will complete the scan with key 1 and one of them will
mark btps_pageStatus as BTPARALLEL_DONE and then first one to hit
_bt_parallel_advance_scan will increment the value of
btps_arrayKeyCount, then same will happen for key 2 and 3. It is
quite possible that by the time one of the participant advances it
local key, the scan for that key is already finished and we handle
that situation in _bt_parallel_seize() with below check:

if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
*status = false;

I think Rahila has also mentioned something on above lines, let us
know if we are missing something here? Do you want to add more
comments in the code to explain this handling, if yes, then where (on
top of function _bt_parallel_advance_scan)?

> It seems like one of those two things has
> to be wrong.
>
> _bt_readpage's header comment should be updated to note that, in the
> case of a parallel scan, _bt_parallel_seize should have been called
> prior to entering this function, and _bt_parallel_release will be
> called prior to return (or this could alternatively be phrased in
> terms of btps_pageStatus on entry/exit).
>

Changed as per suggestion.

> _bt_readnextpage isn't very clear about the meaning of its blkno
> argument. It looks like it always has to be valid when scanning
> forward, but only in the parallel case when scanning backward?
>

It can be only invalid for non-parallel backward scan and in that case
the appropriate value for so->currPos will be set. Refer
_bt_steppage(). I have updated the comments.

> That
> is a little odd. Another, possibly related thing that is odd is that
> when _bt_steppage() finds no tuples and decides to advance to a new
> page again, there's a very short comment in the forward case and a
> very long comment in the backward case:
>
> /* nope, keep going */
> vs.
> /*
> * For parallel scans, get the last page scanned as it is quite
> * possible that by the time we try to fetch previous page, other
> * worker has also decided to scan that previous page. We could
> * avoid that by doing _bt_parallel_release once we have read the
> * current page, but it is bad to make other workers wait till we
> * read the page.
> */
>
> Now it seems to me that these cases are symmetric and the issues are
> identical. It's basically that, while we were judging whether the
> current page has useful contents, some other process could have
> advanced the scan (since _bt_readpage un-seized it).
>

Yeah, but the reason of difference in comments is that for
non-parallel backwards scans there is no code at that place to move to
previous page and it basically relies on next call to _bt_walk_left()
whereas for parallel-scans, we can't simply rely on _bt_walk_left().
I have slightly modified the comments for backward scan case, see if
that looks better and if not, then suggest what you think is better.

> - /* check for deleted page */
> page = BufferGetPage(so->currPos.buf);
> TestForOldSnapshot(scan->xs_snapshot, rel, page);
> opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> + /* check for deleted page */
>
> This is an independent change; committed.
>

Thanks.

> What kind of testing has been done to ensure that this doesn't have
> concurrency bugs?
>

Used large table parallel index scans (both forward and backward
scans). These tests have been done by Tushar and you can find
detailed report up thread [2]. Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well. TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch. Tushar,
can you please share your test results?

> What's the highest degree of parallelism that's
> been tested?

7

> Did that test include array keys?
>

Yes.

Note - The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v7.patch,
parallel_index_opt_exec_support_v7.patch. The first and third patches
are posted up thread [3].

[1] - https://www.postgresql.org/message-id/CA%2BTgmoZv0%2BcLUV7fZRo76_PB9cfu5mBCVmoXKmaqrc7F30nJzw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/1d6353a0-63cb-65d9-a70c-0913899d5b06%40enterprisedb.com
[3] - https://www.postgresql.org/message-id/CAA4eK1KowGSYYVpd2qPpaPPA5R90r%2B%2BQwDFbrRECTE9H_HvpOg%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
parallel_index_scan_v7.patch application/octet-stream 27.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-02-01 13:25:20 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
Previous Message Corey Huinker 2017-02-01 13:18:24 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)