Re: Parallel Index Scans

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Subject: Re: Parallel Index Scans
Date: 2017-01-31 22:22:49
Message-ID: CA+TgmobUR9yE1k3V8qkkM_2CNrFZi74QOUBm1cHC47wDdr6CGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> In spite of being careful, I missed reorganizing the functions in
>> genam.h which I have done in attached patch.
>
> Cool. Committed parallel-generic-index-scan.2.patch. Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

The hunk in indexam.c looks like a leftover that can be omitted.

+/*
+ * Below flags are used to indicate the state of parallel scan.

They aren't really flags any more; they're members of an enum. I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values. But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan

Initializing -> initialize

+ * btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two. Make it
consistent, please.

+ * Ideally, we don't need to acquire spinlock here, but being
+ * consistent with heap_rescan seems to be a good idea.

How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.

+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ * scans and latest block scanned for backward scans.

I think the description should be more like "Begin the process of
advancing the scan to a new page. Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()." And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page. We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE. It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

_bt_readnextpage isn't very clear about the meaning of its blkno
argument. It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward? That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. We could
* avoid that by doing _bt_parallel_release once we have read the
* current page, but it is bad to make other workers wait till we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

- /* check for deleted page */
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* check for deleted page */

This is an independent change; committed.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs? What's the highest degree of parallelism that's
been tested? Did that test include array keys?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-01-31 22:24:17 Re: Patch: Write Amplification Reduction Method (WARM)
Previous Message Tom Lane 2017-01-31 22:21:28 Re: Patch: Write Amplification Reduction Method (WARM)