Re: Parallel bitmap heap scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-14 16:48:18
Message-ID: CA+TgmoY6bOc1YnhcAQnMfCBDbsJzROQ3sYxSAL-SYB5tMJcTKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 14, 2017 at 12:18 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Fixed.

Thanks, the external interface to this looks much cleaner now.
Scrutinizing the internals:

What is the point of having a TBMSharedIterator contain a TIDBitmap
pointer? All the values in that TIDBitmap are populated from the
TBMSharedInfo, but it seems to me that the fields that are copied over
unchanged (like status and nchunks) could just be used directly from
the TBMSharedInfo, and the fields that are computed (like relpages and
relchunks) could be stored directly in the TBMSharedIterator.
tbm_shared_iterate() is already separate code from tbm_iterate(), so
it can be changed to refer to whichever data structure contains the
data it needs.

Why is TBMSharedInfo separate from TBMSharedIteratorState? It seems
like you could merge those two into a single structure.

I think we can simplify things here a bit by having shared iterators
not support single-page mode. In the backend-private case,
tbm_begin_iterate() really doesn't need to do anything with the
pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
got to anyway copy the pagetable into shared memory. So it seems to
me that it would be simpler just to transform it into a standard
iteration array while we're at it, instead of copying it into entry1.
In other words, I suggest removing both "entry1" and "status" from
TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
compensate.

I think "dsa_data" is poorly named; it should be something like
"dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo. I
think you should should move the "base", "relpages", and "relchunks"
into TBMSharedIterator and give them better names, like "ptbase",
"ptpages" and "ptchunks". "base" isn't clear that we're talking about
the pagetable's base as opposed to anything else, and "relpages" and
"relchunks" are named based on the fact that the pointers are relative
rather than named based on what data they point at, which doesn't seem
like the right choice.

I suggest putting the parallel functions next to each other in the
file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
tbm_end_shared_iterate().

+ if (tbm->dsa == NULL)
+ return pfree(pointer);

Don't try to return a value of type void. The correct spelling of
this is { pfree(pointer); return; }. Formatted appropriately across 4
lines, of course.

+ /*
+ * If TBM is in iterating phase that means pagetable is already created
+ * and we have come here during tbm_free. By this time we are already
+ * detached from the DSA because the GatherNode would have been shutdown.
+ */
+ if (tbm->iterating)
+ return;

This seems like something of a kludge, although not a real easy one to
fix. The problem here is that tidbitmap.c ideally shouldn't have to
know that it's being used by the executor or that there's a Gather
node involved, and certainly not the order of operations around
shutdown. It should really be the problem of 0002 to handle this kind
of problem, without 0001 having to know anything about it. It strikes
me that it would probably be a good idea to have Gather clean up its
children before it gets cleaned up itself. So in ExecShutdownNode, we
could do something like this:

diff --git a/src/backend/executor/execProcnode.c
b/src/backend/executor/execProcnode.c
index 0dd95c6..5ccc2e8 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
if (node == NULL)
return false;

+ planstate_tree_walker(node, ExecShutdownNode, NULL);
+
switch (nodeTag(node))
{
case T_GatherState:
@@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
break;
}

- return planstate_tree_walker(node, ExecShutdownNode, NULL);
+ return false;
}

Also, in ExecEndGather, something like this:

diff --git a/src/backend/executor/nodeGather.c
b/src/backend/executor/nodeGather.c
index a1a3561..32c97d3 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -229,10 +229,10 @@ ExecGather(GatherState *node)
void
ExecEndGather(GatherState *node)
{
+ ExecEndNode(outerPlanState(node)); /* let children clean up first */
ExecShutdownGather(node);
ExecFreeExprContext(&node->ps);
ExecClearTuple(node->ps.ps_ResultTupleSlot);
- ExecEndNode(outerPlanState(node));
}

/*

With those changes and an ExecShutdownBitmapHeapScan() called from
ExecShutdownNode(), it should be possible (I think) for us to always
have the bitmap heap scan node shut down before the Gather node shuts
down, which I think would let you avoid having a special case for this
inside the TBM code.

+ char *ptr;
+ dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
+ tbm->dsa_data = dsaptr;
+ ptr = dsa_get_address(tbm->dsa, dsaptr);
+ memset(ptr, 0, size + sizeof(dsa_pointer));
+ *((dsa_pointer *) ptr) = dsaptr;

Hmm. Apparently, we need a dsa_allocate_and_zero() or dsa_allocate0()
function. This pattern seems likely to come up a lot, and I don't
think we should require every caller to deal with it.

I don't see why you think you need to add sizeof(dsa_pointer) to the
allocation and store an extra copy of the dsa_pointer in the
additional space. You are already storing it in tbm->dsa_data and
that seems good enough. pagetable_free() needs the value, but it has
a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
instead of fishing the pointer out of the extra space. Also, this
kind of thing is in general not very wise because it tends to cause
alignment faults on some picky machine. dsa_allocate() will return a
MAXALIGN'd pointer but ptr + sizeof(dsa_pointer) will only be
MAXALIGN'd if (sizeof(dsa_pointer) % MAXIMUM_ALIGNOF) == 0, which I
suspect might not be true everywhere and which is a shaky assumption
for the future even if it happens to be true today. For example,
consider a 32-bit platform where doubles need to be 8-byte aligned.

It seems shaky to me that tbm->iterating can be set when we've created
either a shared or a backend-private iterator. For example,
tbm_iterate() asserts that tbm->iterating is set as a way of checking
that spages and schunks will be set. But that's not guaranteed any
more with these changes, because we might've built the shared version
of the iteration arrays rather than the backend-private version.
Maybe you ought to change it from a bool to an enum:
TBM_NOT_ITERATING, TBM_ITERATING_PRIVATE, TBM_ITERATING_SHARED. And
then update all of the asserts and tests to check for the specific
state they care about.

+ while (schunkbit < PAGES_PER_CHUNK)
+ {
+ int wordnum = WORDNUM(schunkbit);
+ int bitnum = BITNUM(schunkbit);
+
+ if ((chunk->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0)
+ break;
+ schunkbit++;
+ }

How about refactoring this chunk of code into a static inline function
and having both tbm_iterate() and tbm_shared_iterate() call it?
Probably something like static inline void
tbm_advance_schunkbit(PageTableEntry *chunk, int *schunkbit).

+ /* scan bitmap to extract individual offset numbers */
+ ntuples = 0;
+ for (wordnum = 0; wordnum < WORDS_PER_PAGE; wordnum++)
+ {
+ bitmapword w = page->words[wordnum];
+
+ if (w != 0)
+ {
+ int off = wordnum * BITS_PER_BITMAPWORD + 1;
+
+ while (w != 0)
+ {
+ if (w & 1)
+ output->offsets[ntuples++] = (OffsetNumber) off;
+ off++;
+ w >>= 1;
+ }
+ }
+ }

Similarly, this looks like it could be refactored into a shared static
inline function as well, instead of duplicating it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2017-02-14 16:49:08 Re: Add doc advice about systemd RemoveIPC
Previous Message Robert Haas 2017-02-14 14:44:11 Re: UPDATE of partition key