Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-16 15:54:06
Message-ID: CAFiTN-uqMdXQdKDkD94WVYWoPnRCmA+sy_23--iq2ZPemniz6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer? All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState? It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode. In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory. So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo. I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks". "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> + if (tbm->dsa == NULL)
> + return pfree(pointer);
>
> Don't try to return a value of type void. The correct spelling of
> this is { pfree(pointer); return; }. Formatted appropriately across 4
> lines, of course.

Fixed

>
> + /*
> + * If TBM is in iterating phase that means pagetable is already created
> + * and we have come here during tbm_free. By this time we are already
> + * detached from the DSA because the GatherNode would have been shutdown.
> + */
> + if (tbm->iterating)
> + return;
>
> This seems like something of a kludge, although not a real easy one to
> fix. The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown. It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it. It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself. So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
> if (node == NULL)
> return false;
>
> + planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
> switch (nodeTag(node))
> {
> case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
> break;
> }
>
> - return planstate_tree_walker(node, ExecShutdownNode, NULL);
> + return false;
> }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
> void
> ExecEndGather(GatherState *node)
> {
> + ExecEndNode(outerPlanState(node)); /* let children clean up first */
> ExecShutdownGather(node);
> ExecFreeExprContext(&node->ps);
> ExecClearTuple(node->ps.ps_ResultTupleSlot);
> - ExecEndNode(outerPlanState(node));
> }
>
> /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done
(gather_shutdown_children_first.patch does what you have mentioned
above and 0002 implement the BitmapHeapScanShutdown function).

>
> + char *ptr;
> + dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
> + tbm->dsa_data = dsaptr;
> + ptr = dsa_get_address(tbm->dsa, dsaptr);
> + memset(ptr, 0, size + sizeof(dsa_pointer));
> + *((dsa_pointer *) ptr) = dsaptr;
>
> Hmm. Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
> function. This pattern seems likely to come up a lot, and I don't
> think we should require every caller to deal with it.
Done

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space. You are already storing it in tbm->dsa_data and

Done with the help of two pointers as discussed in another mail [1].

[1] https://www.postgresql.org/message-id/CA%2BTgmoaWsCJ5L2du9i1RQaegaNLgOYF2KgRFu%2B7sUAeQc_xBFg%40mail.gmail.com

>
> It seems shaky to me that tbm->iterating can be set when we've created
> either a shared or a backend-private iterator. For example,
> tbm_iterate() asserts that tbm->iterating is set as a way of checking
> that spages and schunks will be set. But that's not guaranteed any
> more with these changes, because we might've built the shared version
> of the iteration arrays rather than the backend-private version.
> Maybe you ought to change it from a bool to an enum:
> TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED. And
> then update all of the asserts and tests to check for the specific
> state they care about.

Done
>
> + while (schunkbit < PAGES_PER_CHUNK)
> + {
> + int wordnum = WORDNUM(schunkbit);
> + int bitnum = BITNUM(schunkbit);
> +
> + if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
> + break;
> + schunkbit++;
> + }
>
> How about refactoring this chunk of code into a static inline function
> and having both tbm_iterate() and tbm_shared_iterate() call it?
> Probably something like static inline void
> tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

Good idea, done this way
>
> + /* scan bitmap to extract individual offset numbers */
> + ntuples = 0;
> + for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
> + {
> + bitmapword w = page->words[wordnum];
> +
> + if (w != 0)
> + {
> + int off = wordnum * BITS_PER_BITMAPWORD + 1;
> +
> + while (w != 0)
> + {
> + if (w & 1)
> + output->offsets[ntuples++] = (OffsetNumber) off;
> + off++;
> + w >>= 1;
> + }
> + }
> + }
>
> Similarly, this looks like it could be refactored into a shared static
> inline function as well, instead of duplicating it.

Done.

Attached patches:

interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
which is used by 0001
gather_shutdown_childeren_first.patch : Processing the child node
before shuting down the gather, 0002 will use that part to shutdown
BitmapNode before gather.

0001: tidbimap change to provide the interfaces for shared iterator.
0002: actual parallel bitmap heap scan by using interfaces of 0001.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
interface_dsa_allocate0.patch application/octet-stream 1018 bytes
gather_shutdown_children_first.patch application/octet-stream 1.0 KB
0001-tidbitmap-support-shared-v3.patch application/octet-stream 22.5 KB
0002-parallel-bitmap-heapscan-v3.patch application/octet-stream 37.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-02-16 15:55:51 Re: Parallel Index-only scan
Previous Message Tom Lane 2017-02-16 15:35:40 Re: duplicate "median" entry in doc